CN104331717B

CN104331717B - The image classification method that a kind of integration characteristics dictionary structure is encoded with visual signature

Info

Publication number: CN104331717B
Application number: CN201410693888.1A
Authority: CN
Inventors: 杨育彬; 朱启海
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2014-11-26
Filing date: 2014-11-26
Publication date: 2017-10-17
Anticipated expiration: 2034-11-26
Also published as: CN104331717A

Abstract

The invention discloses an image classification method integrating feature dictionary structure and visual feature coding, which comprises the following steps: visual feature extraction; feature dictionary learning; visual feature coding; spatial fusion of feature coding; training and classification. The invention can obtain more accurate image feature representation and improve the accuracy of image classification. Furthermore, by integrating the structural information in the feature dictionary into the visual feature encoding process, a more discriminative representation of image features is obtained, thus enabling more efficient classification of images. The invention realizes efficient and accurate image classification, and thus has high use value.

Description

An Image Classification Method Integrating Feature Dictionary Structure and Visual Feature Coding

技术领域technical field

本发明涉及图像分类领域，特别是基于码书模型(Bag-of-Words，BoW)的一种整合特征字典结构与视觉特征编码的图像分类方法The present invention relates to the field of image classification, especially an image classification method based on a codebook model (Bag-of-Words, BoW) that integrates feature dictionary structure and visual feature encoding

背景技术Background technique

随着信息技术的不断飞速发展，各个领域每天都在以惊人的速度产生各种类型的数据，包括文字、图像、视频、音乐等。在丰富多彩的数据信息中，图像因其表现直观生动、内容丰富、信息量大，以及存储与传输方便，备受青睐，并已经成为二十一世纪最重要的信息载体之一。特别是随着照相机、手机、平板等具有拍照功能的移动设备的日益普及，以及社交网络的兴起，人们获取图像的方式越来越多，也进一步促使图像数据急剧增长，快速准确地查找所需图像和高效地管理却因此变得越来越困难。人们迫切希望计算机能帮助人类，对互联网中海量图像所蕴含的语义进行分析，并充分理解图像所表达的内容，从而更有效地对图像进行管理、分类标注，或检索感兴趣的图像。With the continuous rapid development of information technology, various fields are generating various types of data at an alarming rate every day, including text, images, videos, music, etc. Among the rich and colorful data information, images are popular because of their intuitive and vivid performance, rich content, large amount of information, and convenient storage and transmission, and have become one of the most important information carriers in the 21st century. Especially with the increasing popularity of mobile devices with camera functions such as cameras, mobile phones, and tablets, and the rise of social networks, people have more and more ways to obtain images, which further promotes the rapid growth of image data, and quickly and accurately find the required information. Images and efficient management therefore become more and more difficult. People urgently hope that computers can help humans analyze the semantics contained in the massive images on the Internet, and fully understand the content expressed by the images, so as to manage, classify and label images more effectively, or retrieve images of interest.

图像分类作为计算机理解图像最主要的基础技术之一，已经受到了学术界和工业界各研究机构的广泛研究，并在国内外各权威期刊和重要学术会议上作为重要主题，是计算机视觉领域一个极重要的研究课题。图像分类是指按照一定的分类准则将图像智能化地分到一组已有定义类别中的过程，包括物体识别、场景语义分类、行为识别等。图像分类已经成为研究图像语义理解的重要技术手段。科学研究人员已经渐渐意识到以上问题的重要性并不断深入分析。近几年，码书模型为图像高层语义表示带来新的启发，以码书模型为关键技术的图像分类已取得了一定成果，但是仍有许多研究点尚未涉及，仍有巨大的突破空间。基于码书模型的图像分类方法的研究，已经成为当前人工智能、计算机视觉、机器学习和数据挖掘等诸多交叉领域中前沿性的热点，对积极推进社会信息化起到重要作用。在创造了无可替代的社会价值的同时，该领域仍有许多关键技术问题尚未解决，仍有许多功能实现需要进一步完善，因此，如何利用码书模型，更有效地理解和描述图像高层语义，以更灵活地实现图像分类的研究，具有深远的意义。Image classification, as one of the most important basic technologies for computer understanding of images, has been widely studied by various research institutions in academia and industry, and has been used as an important topic in authoritative journals and important academic conferences at home and abroad. It is an important topic in the field of computer vision. very important research topic. Image classification refers to the process of intelligently classifying images into a set of defined categories according to certain classification criteria, including object recognition, scene semantic classification, behavior recognition, etc. Image classification has become an important technical means to study image semantic understanding. Scientific researchers have gradually realized the importance of the above problems and continue to analyze them in depth. In recent years, the codebook model has brought new inspiration to the high-level semantic representation of images. Image classification with the codebook model as the key technology has achieved certain results, but there are still many research points that have not been covered, and there is still huge room for breakthroughs. The research on the image classification method based on the codebook model has become a cutting-edge hotspot in many intersecting fields such as artificial intelligence, computer vision, machine learning and data mining, and plays an important role in actively promoting social informatization. While creating irreplaceable social value, there are still many key technical issues in this field that have not been resolved, and there are still many functions that need to be further improved. Therefore, how to use the codebook model to understand and describe the high-level semantics of images more effectively, The research of realizing image classification more flexibly has far-reaching significance.

发明内容Contents of the invention

发明目的：本发明所要解决的技术问题是针对现有技术的不足，提供一种整合特征字典结构与视觉特征编码的图像分类方法，利用特征字典中视觉单词的分布信息辅助视觉特征编码，以使编码结果更具有判别性，从而提高图像分类的准确率。Purpose of the invention: the technical problem to be solved by this invention is to provide a kind of image classification method integrating feature dictionary structure and visual feature coding, and use the distribution information of visual words in the feature dictionary to assist visual feature coding, so as to make The encoding result is more discriminative, thus improving the accuracy of image classification.

为了解决上述技术问题，本发明公开了一种整合特征字典结构与视觉特征编码的图像分类方法，包含如下步骤：In order to solve the above-mentioned technical problems, the present invention discloses an image classification method integrating feature dictionary structure and visual feature coding, which includes the following steps:

步骤1，提取图像的视觉特征：对每幅图像进行局部采样，得到一组区域块，提取每块区域的视觉特征，得到每幅图像对应的视觉特征集合，称所有图像的视觉特征集合的整体为所有图像的视觉特征集，记为集合X；Step 1, extract the visual features of the image: perform local sampling on each image to obtain a set of area blocks, extract the visual features of each area, and obtain the set of visual features corresponding to each image, which is called the overall set of visual features of all images is the visual feature set of all images, denoted as set X;

步骤2，特征字典学习：以集合X为输入，使用特征字典学习方法，得到由一组具有代表性的视觉单词组成的特征字典；Step 2, feature dictionary learning: take the set X as input, and use the feature dictionary learning method to obtain a feature dictionary composed of a group of representative visual words;

步骤3，视觉特征编码：将每幅图像的每个视觉特征表示成视觉单词的线性组合，每个视觉单词对应一个系数，称这组系数为视觉特征的编码；Step 3, visual feature coding: express each visual feature of each image as a linear combination of visual words, each visual word corresponds to a coefficient, and this group of coefficients is called the coding of visual features;

步骤4，视觉特征编码的空间汇合：以每幅图像的所有视觉特征的编码为输入，使用统计方法，将每幅图像表示为一个向量，该向量就是对应图像的图像特征表示；Step 4, spatial fusion of visual feature encoding: take the encoding of all visual features of each image as input, and use statistical methods to represent each image as a vector, which is the image feature representation of the corresponding image;

步骤5，将步骤4得到的每幅图像的编码作为输入，使用分类模型进行训练和分类，得到分类结果。In step 5, the encoding of each image obtained in step 4 is used as an input, and the classification model is used for training and classification to obtain a classification result.

步骤1具体包括如下步骤：Step 1 specifically includes the following steps:

对每幅图像I进行局部采样，采用等步长的方式做密集采样，得到若干大小相同的区域块，对每个区域块提取一个视觉特征，使用视觉特征提取方法得到表示该局部块一个视觉特征，视觉特征提取方法包括：方向梯度直方图(Histogram of Oriented Gradient,HOG)，尺度不变特征变换(Scale-invariant feature transform，SIFT)等。得到图像I的视觉特征集合LFS_I，最终得到所有图像的视觉特征集合的整体X＝[x₁,x₂,…,x_N]∈R^d×N，其中，d表示视觉特征的维度，其大小由视觉特征提取技术决定，N表示所有图像的视觉特征的总数，x_i表示第i个视觉特征，i取值1～N。Local sampling is performed on each image I, and dense sampling is performed by means of equal steps to obtain several regional blocks of the same size, and a visual feature is extracted for each regional block, and a visual feature representing the local block is obtained by using the visual feature extraction method , Visual feature extraction methods include: Histogram of Oriented Gradient (HOG), Scale-invariant feature transform (Scale-invariant feature transform, SIFT) and so on. Get the visual feature set LFS _I of the image I, and finally get the overall X=[x ₁ ,x ₂ ,…,x _N ]∈R ^d×N of the visual feature set of all images, where d represents the dimension of the visual feature, and its The size is determined by the visual feature extraction technology, N represents the total number of visual features of all images, x _i represents the i-th visual feature, and i takes a value from 1 to N.

步骤2具体包括如下步骤：Step 2 specifically includes the following steps:

以集合X为输入，使用特征字典学习方法，得到一组具有代表性的视觉单词组成的特征字典，将该特征字典记为：B＝[b₁,b₂,…,b_M]∈R^d×M，其中M为视觉单词的个数；b_j是一个维度d的列向量，表示第j个视觉单词，j取值1～M。常用的特征字典学习方法包括：k-means，K-SVD等。Take the set X as input, and use the feature dictionary learning method to obtain a feature dictionary composed of a group of representative visual words. The feature dictionary is recorded as: B=[b ₁ ,b ₂ ,…,b _M ]∈R ^{d ×M} , where M is the number of visual words; b _j is a column vector of dimension d, representing the jth visual word, and j takes a value from 1 to M. Commonly used feature dictionary learning methods include: k-means, K-SVD, etc.

步骤3具体包括如下步骤：Step 3 specifically includes the following steps:

本步骤逐一对集合X中的每个视觉特征编码，对于视觉特征x_i，其编码过程如下：In this step, each visual feature in the set X is encoded one by one. For the visual feature x _i , the encoding process is as follows:

首先，从特征字典B中选出x_i的p个最近邻的视觉单词，即与视觉特征x_i的距离最小的p个视觉单词，记这p个视觉单词组成的特征字典为B_i，p取值1～M，i取值1～N，。First, select the p nearest neighbor visual words of x _i from the feature dictionary B, that is, the p visual words with the smallest distance to the visual feature x _i , record the feature dictionary composed of these p visual words as B _i , p The value is 1~M, and i takes the value 1~N.

其次，求出特征字典B_i中各视觉单词之间的距离所表示的矩阵D_i和计算视觉特征x_i到特征字典B_i的各视觉单词的距离表示的列向量d_i，i取值1～N。矩阵D_i的第m行s列的元素为B_i中对应视觉单词之间的距离，m，s＝1,2,…,p；d_i的第n个分量d_in表示视觉特征x_i与B_i中第n个视觉单词之间的距离，n＝1,2,…,p。距离计算公式为：σ是一个平滑参数，控制权重的下降速度，σ>0。dist(x_i,B_i)＝[dist(x_i,b_i1),dist(x_i,b_i2),…,dist(x_i,b_ip)]^T，b_il表示Bi的第l个视觉单词，l＝1,2,…,p；每个分量表示视觉特征x_i与视觉单词b_il之间的距离；max(dist(x_i,B_i))表示向量dist(x_i,B_i)的最大分量，从而使d_i中分量的值域为(0,1]。在计算一个视觉单词与其他视觉单词之间的距离时，也使用同样的策略。为加快D_i的求解速度，一次性求出B中各视觉单词之间的距离表示的矩阵D。则D_i就是D的子矩阵，通过直接索引D即可获得不同的D_i，i＝1,2,…,N。Secondly, obtain the matrix D _i represented by the distance between the visual words in the feature dictionary B _i and the column vector d _i represented by the distance between the visual feature x _i and the visual words in the feature dictionary B _i , i takes the value 1 ~N. The elements of the mth row and s column of the matrix D _i are the distances between the corresponding visual words in B _i , m, s=1, 2,..., p; the nth component d _in of d _i represents the visual feature x _i and The distance between the nth visual word in Bi, _n =1,2,...,p. The distance calculation formula is: σ is a smoothing parameter that controls the descending speed of the weight, σ>0. dist( _xi ,B _i )＝[dist( _xi ,b _i1 ),dist( _xi ,b _i2 ),…,dist( _xi ,b _ip )] ^T , b _il represents the lth vision of Bi words, l=1,2,…,p; each component Indicates the distance between the visual feature x _i and the visual word b _il ; max(dist( _xi ,B _i )) indicates the maximum component of the vector dist( _xi ,B _i ), so that the value range of the component in d _i is (0,1]. When calculating the distance between a visual word and other visual words, the same strategy is used. In order to speed up the solution of D _i , the distance between each visual word in B is calculated at one time. Matrix D. Then D _i is a sub-matrix of D, and different D _i can be obtained by directly indexing D, i=1, 2,...,N.

第三，以x_i，d_i，D_i，B_i和两个参数λ和β为输入，λ,β≥0，最小化下式，得到x_i在B_i上的编码 Third, take x _i , d _i , D _i , B _i and two parameters λ and β as input, λ, β≥0, minimize the following formula to get the encoding of _xi on B _i

约束条件： Restrictions:

其中表示点积，即两个向量对应的分量相乘得到一个新向量；求解得到x_i在这p个视觉单词的编码结果 in Represents the dot product, that is, the components corresponding to the two vectors are multiplied to obtain a new vector; solve to obtain the encoding result of _xi in the p visual words

最后，对编码中的分量排序，得到k个最大的编码系数及其对应的k个视觉单词构成的特征字典k＝1,2,…,p，则视觉特征x_i的编码z_i是一个M维的向量，向量中与对应的分量为其余分量都置为0。Finally, to encode The components in are sorted to get the k largest coding coefficients and its corresponding feature dictionary composed of k visual words k=1,2,...,p, then the coding z _i of the visual feature x _i is an M-dimensional vector, and the vector is the same as The corresponding components are The rest of the components are set to 0.

步骤5具体包括如下步骤：Step 5 specifically includes the following steps:

考虑每幅图像中的各视觉特征的空间统计信息，用三层的空间金字塔匹配模型(Spatial Pyramid Matching,SPM)，将一幅图像I的所有视觉特征的编码作为输入，结合最大汇合技术，则该空间金字塔输出一个维度为(2⁰+2²+2⁴)*M的向量，该向量即为I的图像特征表示。Consider the spatial statistical information of each visual feature in each image, use a three-layer spatial pyramid matching model (Spatial Pyramid Matching, SPM), take the encoding of all visual features of an image I as input, and combine the maximum convergence technology, then The spatial pyramid outputs a vector with a dimension of (2 ⁰ +2 ² +2 ⁴ )*M, which is the image feature representation of I.

步骤6具体包括如下步骤：Step 6 specifically includes the following steps:

在得到各图像的图像特征表示之后，就可以将它们用于训练和分类。将所有图像的图像特征表示所构成的集合分为训练集和测试集两部分，训练集用于训练分类模型，用训练好的模型对测试集分类。通常选用支持向量机(Support Vector Machine,SVM)作为分类器模型。After obtaining image feature representations for each image, they can be used for training and classification. The set formed by the image feature representation of all images is divided into two parts: training set and test set. The training set is used to train the classification model, and the trained model is used to classify the test set. Usually support vector machine (Support Vector Machine, SVM) is used as the classifier model.

本发明针对图像分类领域中的图像视觉特征编码方法，本发明具有如下特征：1)本发明在对视觉特征编码时，不仅考虑了视觉特征和视觉单词之间的关系，还考虑了视觉单词之间关系对视觉特征编码的影响；2)本发明求得的视觉特征编码是一个解析解，不需要迭代优化函数，因此本发明所述的视觉特征编码方法是快速的。The present invention is aimed at the image visual feature encoding method in the field of image classification. The present invention has the following characteristics: 1) when the present invention encodes visual features, it not only considers the relationship between visual features and visual words, but also considers the relationship between visual words. 2) the visual feature coding obtained by the present invention is an analytical solution, and does not need an iterative optimization function, so the visual feature coding method of the present invention is fast.

有益效果：本发明充分考虑了特征字典中视觉单词的分布这一结构信息，将该信息用于视觉特征的编码，使得视觉特征的编码更能反映特征字典中的视觉单词的分布。因此，图像的图像特征表示具有很强的判别性，从而提升图像分类的准确率。Beneficial effects: the present invention fully considers the structural information of the distribution of visual words in the feature dictionary, and uses this information for coding of visual features, so that the coding of visual features can better reflect the distribution of visual words in the feature dictionary. Therefore, the image feature representation of the image is highly discriminative, thereby improving the accuracy of image classification.

附图说明Description of drawings

下面结合附图和具体实施方式对本发明做更进一步的具体说明，本发明的上述和/或其他方面的优点将会变得更加清楚。The advantages of the above and/or other aspects of the present invention will become clearer as the present invention will be further described in detail in conjunction with the accompanying drawings and specific embodiments.

图1为本发明流程图。Fig. 1 is the flow chart of the present invention.

图2为视觉特征提取示意图Figure 2 is a schematic diagram of visual feature extraction

图3为对一个视觉特征编码的流程图Figure 3 is a flow chart of encoding a visual feature

图4为三层空间金字塔结构示意图。Fig. 4 is a schematic diagram of a three-layer space pyramid structure.

具体实施方式detailed description

如图1所示，本发明公开了一种整合特征字典结构与视觉特征编码的图像分类方法，包含如下步骤：As shown in Figure 1, the present invention discloses an image classification method integrating feature dictionary structure and visual feature coding, which includes the following steps:

步骤1，提取图像的视觉特征：对每幅图像进行局部采样，得到一组区域块，提取每块区域的视觉特征，得到每幅图像对应的视觉特征集合，将所有图像的视觉特征集合的整体记为集合X；Step 1, extract the visual features of the image: perform local sampling on each image to obtain a set of area blocks, extract the visual features of each area, obtain the visual feature set corresponding to each image, and combine the visual feature sets of all images as a whole denoted as set X;

步骤3，视觉特征编码：将每幅图像视觉特征表示成视觉单词的线性组合，每个视觉单词对应一个系数，得到视觉特征编码集合；Step 3, visual feature encoding: express the visual features of each image as a linear combination of visual words, each visual word corresponds to a coefficient, and obtain a visual feature encoding set;

步骤4，视觉特征编码的空间汇合：以每幅图像的所有视觉特的编码为输入，使用统计方法，将每幅图像表示为一个向量，该向量就是对应图像的图像特征表示；Step 4, spatial fusion of visual feature encoding: take all visual feature encodings of each image as input, and use statistical methods to represent each image as a vector, which is the image feature representation of the corresponding image;

1、步骤1包括如下步骤：1. Step 1 includes the following steps:

如图2所示，对于一幅图像I，通常采用等步长密集采样的方式从I抽取若干大小相等的区域块，并对每个区域块提取一个视觉特征，这里的视觉特征是一个d维向量。常用的视觉特征提取方法包括：方向梯度直方图(Histogram of Oriented Gradient,HOG)，尺度不变特征变换(Scale-invariant feature transform，SIFT)等。最终得到所有图像的视觉特征集合的整体X＝[x₁,x₂,…,x_N]∈R^d×N，其中，d表示视觉特征的维度，N表示所有图像的视觉特征的总数，x_i表示第i个视觉特征，i取值1～N。X被用于步骤2作为输入，以学习得到特征字典。As shown in Figure 2, for an image I, a number of equal-sized regional blocks are usually extracted from I by means of equal-step dense sampling, and a visual feature is extracted for each regional block, where the visual feature is a d-dimensional vector. Commonly used visual feature extraction methods include: Histogram of Oriented Gradient (HOG), Scale-invariant feature transform (SIFT), etc. Finally, the overall X=[x ₁ ,x ₂ ,…,x _N ]∈R ^d×N of the visual feature set of all images is obtained, where d represents the dimension of visual features, N represents the total number of visual features of all images, x _i represents the i-th visual feature, and i takes a value from 1 to N. X is used as input in step 2 to learn a feature dictionary.

2、步骤2包括如下步骤：2. Step 2 includes the following steps:

在本步骤以集合X为输入，使用特征字典学习方法得到M个d维的视觉单词构成的特征字典B＝[b₁,b₂,…,b_M]∈R^d×M，其中M为视觉单词的个数；b_j是一个维度d的列向量，表示第j个视觉单词，j取值1～M。以k-means方法为例，使用k-means将集合X聚为M个类，每个类中心就是一个视觉单词。In this step, the set X is used as input, and the feature dictionary B=[b ₁ ,b ₂ ,…,b _M ]∈R ^d×M composed of M d-dimensional visual words is obtained by using the feature dictionary learning method, where M is the visual The number of words; b _j is a column vector of dimension d, representing the jth visual word, and j takes a value from 1 to M. Taking the k-means method as an example, use k-means to cluster the set X into M classes, and the center of each class is a visual word.

3、步骤3包括如下步骤：3. Step 3 includes the following steps:

本步骤逐一对集合X中的每个视觉特征编码。This step encodes each visual feature in the set X one by one.

如图3所示的流程图描述了一个视觉特征的编码过程，针对视觉特征x_i，选取视觉特征x_i的由步骤2得到的特征字典B中的p个最近邻的视觉单词，即与视觉特征x_i的距离最小的p个视觉单词，p取值1～M，记这p个视觉单词组成的特征字典为B_i，i取值1～N，求出特征字典B_i中各视觉单词之间的距离所表示的矩阵D_i，矩阵D_i的第m行s列的元素为B_i中对应视觉单词之间的距离，m，s＝1,2,…,p，再计算视觉特征x_i到特征字典B_i的各视觉单词的距离表示的列向量d_i，d_i的第n个分量d_in表示视觉特征x_i与B_i中第n个视觉单词之间的距离，n＝1,2,…,p；以x_i，d_i，D_i，B_i和两个参数λ与β为输入，λ,β≥0，最小化下式，得到x_i在B_i上的编码 The flow chart shown in Figure 3 describes a visual feature encoding process. For the visual feature x _i , select the p nearest neighbor visual words in the feature dictionary B obtained in step 2 of the visual feature x _i , that is, The p visual words with the smallest distance between features x _i , p takes a value of 1 to M, and the feature dictionary composed of these p visual words is B _i , i takes a value of 1 to N, and finds each visual word in the feature dictionary B _i The matrix D _i represented by the distance between, the element of the mth row s column of the matrix D _i is the distance between the corresponding visual words in B _i , m, s=1,2,...,p, and then calculate the visual features The column vector d _i represented by the distance between x _i and each visual word in the feature dictionary B _i , the nth component d _in of d _i represents the distance between the visual feature x _i and the nth visual word in B _i , n= 1,2,...,p; take x _i , d _i , D _i , B _i and two parameters λ and β as input, λ, β≥0, minimize the following formula to get the encoding of _xi on B _i

约束条件： Restrictions:

其中表示点积，即两个向量对应的分量相乘得到一个新向量；求解得到x_i在这p个视觉单词的编码结果最后对编码中的分量排序，得到k个最大的编码系数及其对应的k个视觉单词k＝1,2,…,p，则x_i的编码z_i是一个M维的向量，向量中与对应的分量为其余分量都置为0。in Represents the dot product, that is, the components corresponding to the two vectors are multiplied to obtain a new vector; solve to obtain the encoding result of _xi in the p visual words final pair encoding The components in are sorted to get the k largest coding coefficients and its corresponding k visual words k=1,2,...,p, then the encoding z _i of x _i is an M-dimensional vector, in which the The corresponding components are The rest of the components are set to 0.

视觉特征x_i在B上的具体编码方法如下：The specific encoding method of visual feature x _i on B is as follows:

输入：图像视觉特征x_i，特征字典B＝[b₁,b₂,…,b_M]∈R^d×M，M为B中的视觉单词数以及x_i在B上的编码的维度。x_i的最近邻单词个数p，参数k,λ和β。Input: image visual feature _xi , feature dictionary B=[b ₁ ,b ₂ ,…,b _M ]∈R ^d×M , M is the number of visual words in B and the dimension of encoding of _xi on B. The number p of nearest neighbor words of x _i , parameters k, λ and β.

编码过程：Encoding process:

1)计算视觉特征x_i与所有视觉单词的距离所表示的M维的向量d′_i；1) Calculate the M-dimensional vector d′ _i represented by the distance between the visual feature x _i and all visual words;

2)对d′_′中分量按升序排序，并选出p个距离最小的视觉单词构成的集合B_i，及对应的距离d_i；2) Sort the components in d′ _′ in ascending order, and select a set B _i composed of p visual words with the smallest distance, and the corresponding distance d _i ;

3)求出B_i中各视觉单词之间的距离所表示的矩阵D_i；3) Find the matrix D _{i represented by the distance between each visual word in B i} _;

4)根据如下式子求出编码 4) Calculate the code according to the following formula

Ψ＝(x_i1^T-B_i)^T(x_i1^T-B_i)Ψ＝(x _i 1 ^T -B _i ) ^T (x _i 1 ^T -B _i )

Θ＝Ψ+λ*diag²(d_i)+βD_i Θ＝Ψ+λ*diag ² (d _i )+βD _i

α＝-(1^TΘ^-11)α＝-(1 ^T Θ ^-1 1)

其中diag(d_i)表示对角向量是di的对角矩阵。此处1表示分量全为1的列向量；where diag(d _i ) means that the diagonal vector is the diagonal matrix of di. Here 1 means a column vector whose components are all 1;

5)对中的分量按降序排序，得到k个最大的编码系数及其对应的k个视觉单词构成的特征字典则x_i的编码z_i是一个M维的向量，向量中与对应的分量为其余分量都置为0。使用式子z_i＝(1^Tz_i)^-1z_i归一化z_i；5) yes The components in are sorted in descending order to get the k largest coding coefficients and its corresponding feature dictionary composed of k visual words Then the encoding z _i of x _i is an M-dimensional vector, and the vector is the same as The corresponding components are The rest of the components are set to 0. Normalize z _i using the formula z _i =(1 ^T z _i ) ^-1 z _i ;

输出：视觉特征x_i的编码z_i。Output: Encoding _{zi of visual feature xi} _.

4、步骤4包括如下步骤：4. Step 4 includes the following steps:

如图4所示为一个三层的空间金字塔匹配模型，在得到一幅图像的所有视觉特征编码后，采用空间金字塔匹配模型(Spatial Pyramid Matching,SPM)，结合最大汇合(MaxPooling)这一空间汇合技术，以一幅图像的所有视觉特征编码为输入，得到一个向量，该向量就是这幅图像的图像特征表示。具体操作为：以图像中心为原点，使用不同尺度，递归地划分为若干子区域，例如图4中使用三层的空间金字塔匹配模型，一共有2⁰+2²+2⁴＝21个子区域。对于第a个区域，a取值1～21，使用最大汇合技术得到该区域的编码该式子表示这个图像子区域一共有t个视觉特征；a_t表示该区域的第h个视觉特征的编码，h取值1～t；z′_a是一个维度和z_ah相同的列向量，即其维度为M，它的第q个分量是矩阵的对应行的最大值，即q取值1～M。进一步将z′_q归一化，例如使用2范数归一化得z′_q＝z′_q/||z′_q||₂。最后将各子区域的编码依次拼接，得到该图像的图像特征表示。As shown in Figure 4, it is a three-layer spatial pyramid matching model. After obtaining all the visual feature encodings of an image, the spatial pyramid matching model (Spatial Pyramid Matching, SPM) is used, combined with the spatial convergence of MaxPooling. The technology uses all the visual feature codes of an image as input to obtain a vector, which is the image feature representation of the image. The specific operation is: take the center of the image as the origin, use different scales, and recursively divide it into several sub-regions. For example, in Figure 4, a three-layer spatial pyramid matching model is used, and there are a total of 2 ⁰ +2 ² +2 ⁴ =21 sub-regions. For the a-th area, a takes the value of 1 to 21, and the code of this area is obtained by using the maximum convergence technique This formula indicates that there are t visual features in this image sub-region; a _t represents the encoding of the h-th visual feature in this region, and h takes a value from 1 to t; z′ _a is a column vector with the same dimension as z _ah , That is, its dimension is M, and its qth component is a matrix The maximum value of the corresponding row of q takes a value from 1 to M. Further normalize z' _q , for example, use 2-norm normalization to get z' _q = z' _q /||z' _q || ₂ . Finally, the codes of each sub-region are sequentially spliced to obtain the image feature representation of the image.

5、步骤5包括如下步骤：5. Step 5 includes the following steps:

在得到所有图像的图像特征表示后，用作为训练集的图像的图像特征表示训练SVM分类模型，再使用训练好的SVM模型对用作测试集的图像的图像特征表示分类。After obtaining the image feature representations of all images, use the image feature representations of the images in the training set to train the SVM classification model, and then use the trained SVM model to classify the image feature representations of the images in the test set.

实施例1Example 1

本实施例包括以下部分：This embodiment includes the following parts:

1、首先将图像缩小到不超过300×300的尺寸，并转化为灰度图，然后采用密集采样策略，从图像中抽取16×16像素的图像块，每隔6像素抽取一次，对每个图像块提取一个SIFT特征。因此一幅图像可能包含成百上千个特征，取决于提取特征时的图像块大小和间隔大小。1. First reduce the image to a size of no more than 300×300, and convert it into a grayscale image, and then use a dense sampling strategy to extract image blocks of 16×16 pixels from the image, every 6 pixels, for each Image patches extract a SIFT feature. Therefore, an image may contain hundreds or thousands of features, depending on the image block size and spacing when extracting features.

2、首先使用k-均值(k-means)将所有图像视觉特征聚为M个簇，每个簇中心就代表一个视觉单词。设定最近邻视觉单词个数p，密集近邻视觉单词个数k，距离平滑参数σ，正则化参数λ和β。对每个视觉特征编码。2. First use k-means to gather all image visual features into M clusters, and the center of each cluster represents a visual word. Set the number of nearest neighbor visual words p, the number of dense nearest neighbor visual words k, the distance smoothing parameter σ, and the regularization parameters λ and β. Encode each visual feature.

3、使用空间金子匹配模型和最大汇合技术，将每幅图像的所有视觉特征编码汇合为一个向量作为该图像的图像特征表示。并使用支持向量机模型对图像进行训练和分类。3. Use the spatial golden matching model and the maximum fusion technology to combine all the visual feature codes of each image into a vector as the image feature representation of the image. And use a support vector machine model to train and classify the images.

实施例2Example 2

对图像提取维度为128的视觉特征，特征字典的大小及视觉单词的的数量设置为1024。分别将p和k设置为10和5。其它的参数设置还包括：λ＝10^-4，β＝10^-4。使用3层的空间金子塔匹配及最大汇合技术。得到每个图像的21504维的图像特征表示。使用作为训练集的图像的图像特征表示训练SVM分类模型，并用训练好的模型对作为测试集的图像的图像特征表示分类，得到最终的分类结果。For the image extraction dimension of 128 visual features, the size of the feature dictionary and the number of visual words are set to 1024. Set p and k to 10 and 5, respectively. Other parameter settings also include: λ=10 ^-4 , β=10 ^-4 . Use 3 layers of spatial pyramid matching and maximum convergence technology. A 21504-dimensional image feature representation is obtained for each image. Use the image feature representation of the image as the training set to train the SVM classification model, and use the trained model to classify the image feature representation of the image as the test set to obtain the final classification result.

本发明提供了一种整合特征字典结构与视觉特征编码的图像分类方法，具体实现该技术方案的方法和途径很多，以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。本实施例中未明确的各组成部分均可用现有技术加以实现。The present invention provides an image classification method that integrates feature dictionary structure and visual feature coding. There are many methods and approaches to realize this technical solution. The above description is only a preferred embodiment of the present invention. Those of ordinary skill may make some improvements and modifications without departing from the principle of the present invention, and these improvements and modifications shall also be regarded as the protection scope of the present invention. All components that are not specified in this embodiment can be realized by existing technologies.

Claims

1. the image classification method that a kind of integration characteristics dictionary structure is encoded with visual signature, it is characterised in that including following step Suddenly：

Step 1, the visual signature of image is extracted：Local sampling is carried out to each image, one group of region unit is obtained, every piece of area is extracted The visual signature in domain, obtains the corresponding visual signature set of each image, claims the visual signature set of all images generally The visual signature collection of all images, is designated as set X；

Step 2, characteristics dictionary learns：Using set X as input, using characteristics dictionary learning method, obtain that there is representative by one group Property vision word composition characteristics dictionary；

Step 3, visual signature is encoded：Each visual signature of each image is expressed as the linear combination of vision word, each Vision word one coefficient of correspondence, the coefficient is called the coding of visual signature；

Step 4, the space of visual signature coding is converged：Input is encoded to all visual signatures of each image, system is used Meter method, a vector is expressed as by each image, and the vector is exactly the image feature representation of correspondence image；

Step 5, the coding of each image step 4 obtained is trained and classified using disaggregated model as input, is obtained Classification results；

Step 1 comprises the following steps：

Local sampling is carried out for image I, sampling every time obtains a region unit, and each region unit extracts a visual signature, Obtain image I visual signature set LFS_I, finally give the visual signature set X=[x of all images₁,x₂,…,x_N]∈R^d ^×N, wherein, d represents the dimension of visual signature, and N represents the sum of the visual signature of all images, x_iI-th of visual signature is represented, 1~N of i values；

Step 2 comprises the following steps：

Using set X as input, using characteristics dictionary learning method, the spy being made up of one group of representative vision word is obtained Dictionary is levied, this feature dictionary is designated as：B=[b₁,b₂,…,b_j,…,b_M]∈R^d×M, wherein M is the number of vision word；b_jIt is One dimension d column vector, represents j-th of vision word, 1~M of j values；

Step 3 comprises the following steps：

For visual signature x_i, choose visual signature x_iThe characteristics dictionary B obtained by step 2 in p arest neighbors vision list Word, i.e., with visual signature x_iThe minimum p vision word of distance, p 1~M of value remember the feature that this p vision word is constituted Dictionary is B_i, i 1~N of value obtain characteristics dictionary B_iIn matrix D represented by the distance between each vision word_i, matrix D_i's The element of m rows s row is characterized dictionary B_iIn the distance between m-th and s-th vision word, m, s=1,2 ..., p；Count again Calculate visual signature x_iTo characteristics dictionary B_iEach vision word the column vector d that represents of distance_i, d_iN-th of component d_inExpression is regarded Feel feature x_iWith B_iIn the distance between n-th vision word, n=1,2 ..., p, with x_i, d_i, D_i, B_iIt is with β with two parameter lambdas Input, λ, β >=0 minimizes following formula, obtains x_iIn B_iOn coding

Constraints：

WhereinRepresent that the corresponding component of the vector of dot product, i.e., two is multiplied and obtain a new vector；Solution obtains x_iRegarded at this p Feel the coding result of wordFinally to codingIn component sequence, obtain k maximum code coefficientAnd its it is corresponding The characteristics dictionary that k vision word is constitutedK=1,2 ..., p, then visual signature x_iCoding z_iIt is the vector of a M dimension, In vector withCorresponding component isRemaining component is all set to 0.

2. according to the method described in claim 1, it is characterised in that step 5 comprises the following steps：Matched using spatial pyramid Model, a vector is merged into as the image feature representation of the image using the coding of all visual signatures of each image.

3. method according to claim 2, it is characterised in that step 6 comprises the following steps：The image for obtaining all images is special Levy after the set for representing constituted, the set be divided into training set and test set two parts, training set is used for train classification models, Test set is classified with the model trained.