CN112132004A

CN112132004A - A fine-grained image recognition method based on multi-view feature fusion

Info

Publication number: CN112132004A
Application number: CN202010992253.7A
Authority: CN
Inventors: 黄伟锋; 张甜; 常东良; 马占宇; 柳斐; 王丹; 刘念
Original assignee: South To North Water Transfer Middle Route Information Technology Co ltd; Beijing University of Posts and Telecommunications
Current assignee: South To North Water Transfer Middle Route Information Technology Co ltd; Beijing University of Posts and Telecommunications
Priority date: 2020-09-21
Filing date: 2020-09-21
Publication date: 2020-12-25
Anticipated expiration: 2040-09-21
Also published as: CN112132004B

Abstract

A fine-grained image recognition method based on multi-view feature fusion, relates to the technical field of image processing, and solves the problem that existing fine-grained image recognition methods ignore the detailed information of images and have poor adaptability to visual differences between images, and the introduced The loss function is complex, which increases the number of parameters of the model, including the bilinear feature extraction step, the suppression branch learning step, the similar comparison module learning step, the center loss calculation step and the model optimization loss function calculation step; the invention introduces the suppression branch, By suppressing the most salient regions in the image, the network is forced to look for subtle discriminative features between confusing classes. A similar comparison learning module is introduced to fuse the feature vectors of similar samples to increase the interactive information of different images under the same category. A center loss function is also introduced to minimize the distance between features and corresponding class centers, making the learned features more discriminative. Improves the accuracy of fine-grained image recognition.

Description

A fine-grained image recognition method based on multi-view feature fusion

技术领域technical field

本发明涉及图像处理技术领域，具体涉及一种基于多视角特征融合的细粒度图像识别方法。The invention relates to the technical field of image processing, in particular to a fine-grained image recognition method based on multi-view feature fusion.

背景技术Background technique

细粒度图像分类是在区分出基本类别的基础上，进行更精细的子类划分，例如鸟类或犬类等。因此该问题需要捕获细微的类间差异，充分挖掘图像的判别性特征。Fine-grained image classification is based on distinguishing basic categories, and then performs more fine-grained sub-categories, such as birds or dogs. Therefore, this problem needs to capture subtle inter-class differences and fully exploit the discriminative features of images.

细粒度物体在现实生活中广泛存在，与之相对应的细粒度图像识别是计算机视觉识别中的一个重要研究课题。当前细粒度图像识别主要存在以下三个方面的挑战：(1)由于姿势、背景和拍摄角度的不同，同一类别外表看来可能具有较大的差异。(2)不同类别由于属于同一父类，它们之间的差异仅存在于一些细微的区域中，例如鸟的喙和尾巴等。(3)对细粒度图像的收集、标注费时费力。如图1所示。Fine-grained objects exist widely in real life, and the corresponding fine-grained image recognition is an important research topic in computer vision recognition. The current fine-grained image recognition mainly has the following three challenges: (1) Due to the different poses, backgrounds and shooting angles, the appearance of the same category may appear to be quite different. (2) Since different categories belong to the same parent category, the differences between them only exist in some subtle areas, such as the beak and tail of birds. (3) The collection and labeling of fine-grained images is time-consuming and labor-intensive. As shown in Figure 1.

现有的方法主要通过以下三个方面达到识别的目的：(1)基于定位-分类网络进行细粒度图像识别。(2)通过开发用于细粒度识别的强大深度模型来直接学习更具判别力的表征。(3)结合图像的全局特征与局部特征实现图像的细粒度分类。The existing methods mainly achieve the purpose of recognition through the following three aspects: (1) Fine-grained image recognition based on localization-classification network. (2) Directly learn more discriminative representations by developing powerful deep models for fine-grained recognition. (3) Combine the global features and local features of images to achieve fine-grained classification of images.

现有技术1，双线性池化细粒度图像分类(Bilinear pooling)，通过预训练的孪生卷积神经网络(convolutional neural networks)提取特征，并在特征的各个通道层面进行双线性池化，得到特征的高阶表示，从而增强特征的判别能力。该方法得益于通过一种新的池化方式，实现了细粒度图像识别准确率的提升。Prior art 1, bilinear pooling fine-grained image classification (Bilinear pooling), extracts features through pre-trained twin convolutional neural networks (convolutional neural networks), and performs bilinear pooling at each channel level of the features, A high-order representation of the feature is obtained, thereby enhancing the discriminative ability of the feature. This method benefits from a new pooling method, which improves the accuracy of fine-grained image recognition.

该方法提出了一种新的双线性池化方式，但在细粒度图像类别间关系、模型参数量、细节区域的数量等方面没有针对细粒度图像识别进行有效的设计。即没有考虑到细粒度图像蕴含多种细节信息，类间差异小，类内差异大等因素的影响。This method proposes a new bilinear pooling method, but there is no effective design for fine-grained image recognition in terms of the relationship between fine-grained image categories, the amount of model parameters, and the number of detail regions. That is, it does not take into account the influence of factors such as fine-grained images containing a variety of detailed information, small differences between classes, and large differences within classes.

现有技术2，多注意多分类约束网络(Multi-Attention Multi-ClassConstraint)，该方法通过一次压缩-多次扩展(one-squeeze multi-excitation)模块提取输入图像的多个注意(attention)区域，然后引入度量学习(metric learning)，采用triplet loss和softmax loss来训练网络，将同类特征的相同注意拉近，将不同注意或不同类的特征推开，从而加强了部件之间的关系，实现了细粒度图像识别准确率的提升。Prior art 2, Multi-Attention Multi-Class Constraint, this method extracts multiple attention regions of the input image through a one-squeeze multi-excitation module, Then metric learning is introduced, triplet loss and softmax loss are used to train the network, the same attention of the same features is drawn closer, and the features of different attention or different classes are pushed apart, thereby strengthening the relationship between components and realizing Improved accuracy of fine-grained image recognition.

该方法主要利用度量学习来改善特征空间中的样本分布，因此，它对于挖掘一对图像之间视觉差异的适应能力较差。而且引入的损失函数较为复杂，需要构建大量样本对，极大地增加了模型的参数量。This method mainly utilizes metric learning to improve the sample distribution in the feature space, so it is less adaptable for mining visual differences between a pair of images. Moreover, the introduced loss function is relatively complex, and a large number of sample pairs need to be constructed, which greatly increases the number of parameters of the model.

发明内容SUMMARY OF THE INVENTION

本发明为解决现有细粒度图像识别方法忽略了图像的细节信息以及存在图像之间视觉差异的适应能力较差，且引入的损失函数复杂，增加了模型的参数量等问题，提供一种基于多视角特征融合的细粒度图像识别方法。In order to solve the problems that the existing fine-grained image recognition method ignores the detailed information of the image and the visual difference between the images has poor adaptability, the introduced loss function is complex, and the parameter quantity of the model is increased, the present invention provides a method based on A fine-grained image recognition method based on multi-view feature fusion.

一种基于多视角特征融合的细粒度图像识别方法，该方法由以下步骤实现：A fine-grained image recognition method based on multi-view feature fusion, the method is realized by the following steps:

步骤一、双线性特征提取；Step 1. Bilinear feature extraction;

将原始图像输入至双线性特征提取网络，将不同卷积层输出的特征图进行融合，获得双线性特征向量；所述特征提取网络采用在数据集ImageNet预训练的网络结构；The original image is input into the bilinear feature extraction network, and the feature maps output by different convolutional layers are fused to obtain bilinear feature vectors; the feature extraction network adopts the network structure pre-trained in the dataset ImageNet;

步骤二、抑制分支学习，具体过程为：Step 2: Suppress branch learning, the specific process is:

步骤二一、根据步骤一所述特征提取网络不同卷积层输出的特征图的大小及阈值生成注意图；Step 21: Generate an attention map according to the size and threshold of the feature maps output by different convolutional layers of the feature extraction network described in step 1;

步骤二二、根据步骤二一所述的注意图生成抑制掩码，将所述抑制掩码覆盖到所述原始图像上生成局部区域被掩蔽的抑制图像；Step 22, generating a suppression mask according to the attention map described in step 21, and overlaying the suppression mask on the original image to generate a suppressed image in which the local area is masked;

步骤二三、将步骤二所述的抑制图像采用步骤一进行双线性特征提取，获得双线性特征向量，将所述双线性特征向量输入至全连接层，获得预测的类别概率值，并对该预测的类别概率值计算多分类交叉熵；Step 2 and 3: Use step 1 to perform bilinear feature extraction on the suppressed image described in step 2, obtain bilinear feature vector, input the bilinear feature vector to the fully connected layer, and obtain the predicted class probability value, And calculate the multi-class cross entropy for the predicted class probability value;

步骤三、同类对比模块学习；Step 3. Similar comparison module learning;

步骤三一、随机选择与原始图像同一类别下其他N张图像作为正样本图像；Step 31: Randomly select other N images in the same category as the original image as positive sample images;

步骤三二、将目标图像与步骤三一所述的正样本图像送入步骤一所述的特征提取网络进行双线性特征向量融合，获得融合特征综合了同类别下多张图像的双线性特征向量；Step 32: The target image and the positive sample image described in step 31 are sent to the feature extraction network described in step 1 for bilinear feature vector fusion, and the fusion feature is obtained. Feature vector;

步骤三三、将步骤三二获得的同类别下多张图像的双线性特征向量求平均，获得一个融合的特征向量，将融合后的特征向量输入全连接层，获得预测的概率，对得到的同类别的预测概率计算多分类交叉熵；Step 33: Average the bilinear eigenvectors of multiple images of the same category obtained in step 32 to obtain a fused feature vector, and input the fused feature vector into the fully connected layer to obtain the predicted probability. Calculate the multi-class cross entropy of the predicted probability of the same category;

步骤四、中心损失函数L_C计算；Step 4: Calculate the central loss function _LC ;

令v_i为第i个样本的双线性特征，c_i为样本i对应类别所有样本的平均特征，即类中心，N为当前批量的样本个数，则中心损失函数L_C的公式如下：Let v _i be the bilinear feature of the ith sample, c _i be the average feature of all samples in the corresponding category of sample i, that is, the class center, and N be the number of samples in the current batch, then the formula of the center loss function L _C is as follows:

步骤五、模型优化损失函数计算；Step 5. Model optimization loss function calculation;

将原始图像的双线性特征向量的交叉熵损失函数，抑制图像的双线性特征向量的交叉熵损失函数，融合特征的交叉熵损失函数以及中心损失函数加权求和，获得模型优化的损失函数。The cross-entropy loss function of the bilinear feature vector of the original image, the cross-entropy loss function of the bilinear feature vector of the suppressed image, the cross-entropy loss function of the fusion feature and the central loss function are weighted and summed to obtain the loss function of the model optimization. .

本发明的有益效果：本发明综合考虑了细粒度图像类内差异大，类间差异小，背景噪声影响大等因素，引入一个抑制分支，通过抑制图像中最显著的区域，从而迫使网络寻找易混淆类别间微妙的判别性特征。还引入了一个同类对比学习模块，将同类样本的特征向量进行融合，从而增加了同一类别下不同图像的交互信息。同时，引入了中心损失函数，最小化特征与对应类中心之间的距离，使学到的特征更具判别性。Beneficial effects of the present invention: The present invention comprehensively considers factors such as large intra-class differences, small inter-class differences, and large background noise in fine-grained images, and introduces a suppression branch, which suppresses the most significant area in the image, thereby forcing the network to find easy-to-use Confuse subtle discriminative features between classes. A class comparison learning module is also introduced to fuse the feature vectors of samples of the same class, thereby increasing the interactive information of different images under the same class. At the same time, a center loss function is introduced to minimize the distance between the feature and the corresponding class center, making the learned features more discriminative.

综合以上几点，本发明在判断过程综合利用了全局特征和局部特征，在多个细粒度图像分类任务上取得了明显的性能提升，相比现有的方法更具鲁棒性，且易于实际部署。提升了细粒度图像识别的准确率。Based on the above points, the present invention comprehensively utilizes global features and local features in the judgment process, and achieves obvious performance improvement in multiple fine-grained image classification tasks, which is more robust and easy to practice compared with existing methods. deploy. Improves the accuracy of fine-grained image recognition.

附图说明Description of drawings

图1为中图1a、图1b、图1c和图1d均为现有的4组细粒度图像示意图；Fig. 1 is a schematic diagram of four existing groups of fine-grained images in Fig. 1a, Fig. 1b, Fig. 1c and Fig. 1d;

图2为本发明所述的一种基于多视角特征融合的细粒度图像识别方法中双线性特征提取的示意图；2 is a schematic diagram of bilinear feature extraction in a fine-grained image recognition method based on multi-view feature fusion according to the present invention;

图3为本发明所述的一种基于多视角特征融合的细粒度图像识别方法中同类对比学习的示意图；FIG. 3 is a schematic diagram of similar comparison learning in a fine-grained image recognition method based on multi-view feature fusion according to the present invention;

图4为本发明所述的一种基于多视角特征融合的细粒度图像识别方法中模型优化损失函数计算的示意图；4 is a schematic diagram of model optimization loss function calculation in a fine-grained image recognition method based on multi-view feature fusion according to the present invention;

图5为本发明所述的一种基于多视角特征融合的细粒度图像识别方法获得的特征可视化效果图。FIG. 5 is a feature visualization effect diagram obtained by a fine-grained image recognition method based on multi-view feature fusion according to the present invention.

具体实施方式Detailed ways

结合图2至图5说明本实施方式，一种基于多视角特征融合的细粒度图像识别方法，该方法由以下步骤实现：The present embodiment, a fine-grained image recognition method based on multi-view feature fusion, is described with reference to FIGS. 2 to 5 , and the method is implemented by the following steps:

步骤一、双线性特征提取步骤：采用在ImageNet上的预训练的ResNet-50网络结构，输入大小固定的原始图像，将不同卷积层输出的特征图进行融合，得到双线性特征向量。Step 1. Bilinear feature extraction step: Using the pre-trained ResNet-50 network structure on ImageNet, input the original image with a fixed size, and fuse the feature maps output by different convolutional layers to obtain bilinear feature vectors.

结合图2，在特征提取步骤中，使用在数据集ImageNet预训练的网络作为特征提取的基础网络，常用的图像分类网络如VGGNet、GoogLeNet、ResNet等，可对其进行微调(fine-tune)使模型适应特定的任务。具体来说，将原始图像输入至特征提取网络，得到最后两个卷积层输出的特征图(feature map)，分别记为

其中D1、D2表示两种特征的通道数，H和W分别表示特征图的高度和宽度。为了解决融合后特征维数过高的问题，同时保证生成的特征向量中所包含的特征信息足够多，只随机抽取F₂中n个通道的特征图与F₁进行融合。记特征图F₁与F₂中各个位置沿通道的特征向量为

两个特征向量相乘后可得到双线性矩阵

将特征图中各个位置对应的双线性矩阵进行相加，并将矩阵展为一个向量，即为双线性向量

其中D＝D1×D2。该双线性向量提供了比线性模型更强的特征表示。Combined with Figure 2, in the feature extraction step, the network pre-trained in the dataset ImageNet is used as the basic network for feature extraction. Commonly used image classification networks such as VGGNet, GoogLeNet, ResNet, etc., can be fine-tuned (fine-tune). Models are adapted to specific tasks. Specifically, the original image is input to the feature extraction network, and the feature maps output by the last two convolutional layers are obtained, which are recorded as

Among them, D1 and D2 represent the number of channels of the two features, and H and W represent the height and width of the feature map, respectively. In order to solve the problem that the feature dimension is too high after fusion, and at the same time ensure that the generated feature vector contains enough feature information, only the feature maps of _n channels in _F2 are randomly extracted and fused with F1. The feature vectors of each position along the channel in the feature maps F ₁ and F ₂ are written as

The bilinear matrix can be obtained by multiplying the two eigenvectors

Add the bilinear matrices corresponding to each position in the feature map, and expand the matrix into a vector, which is a bilinear vector

where D=D1×D2. This bilinear vector provides a stronger feature representation than the linear model.

步骤二、抑制分支学习步骤：Step 2. Suppression branch learning steps:

A、注意图生成步骤：根据特征图的大小及阈值生成注意图。A. Attention map generation step: generate an attention map according to the size and threshold of the feature map.

B、图像抑制步骤:根据注意图生成抑制掩码，覆盖到原始图像上生成局部区域被掩蔽的抑制图像。B. Image suppression step: generate a suppression mask according to the attention map, and overlay it on the original image to generate a suppressed image with masked local areas.

C、多分类交叉熵计算步骤：抑制图像经由步骤一得到双线性特征向量，再输入至全连接层得到预测概率值，对得到的类别预测值计算多分类交叉熵。C. Multi-classification cross-entropy calculation step: The suppression image obtains a bilinear feature vector through step 1, and then inputs it to the fully connected layer to obtain the predicted probability value, and calculates the multi-classification cross-entropy for the obtained class prediction value.

在抑制分支学习步骤中，包含以下三个方面：In the suppression branch learning step, the following three aspects are included:

步骤A对特征提取网络中卷积层输出的特征图

的各个通道求平均值p_d，按该平均值进行排序，选择top-5的值计算熵：Step A: The feature map output by the convolutional layer in the feature extraction network

Calculate the average value of each channel p _d , sort by this average value, and select the top-5 value to calculate the entropy:

通过比较熵和阈值δ的大小来构建注意图A：Construct the attention map A by comparing the magnitude of entropy and threshold δ:

步骤B将注意图放大至原始图像大小，计算其平均值m，以m*θ作为阈值，将注意图中大于阈值的元素设置为0，将其他元素设置为1，从而获得抑制掩码M：Step B enlarges the attention map to the original image size, calculates its average value m, uses m*θ as the threshold, sets the elements in the attention map larger than the threshold value to 0, and sets other elements to 1, thereby obtaining the suppression mask M:

计算所述注意图的平均值m，设定范围在0～1之间的阈值θ，Calculate the average value m of the attention map, and set a threshold θ ranging from 0 to 1,

步骤C将抑制掩码覆盖到原始图像，从而得到局部区域被掩蔽的抑制图像：Step C covers the suppression mask to the original image, so as to obtain the suppressed image with the masked local area:

I_s(x,y)＝I(x,y)*M(x,y)Is (x,y)=I( _x ,y)*M(x,y)

式中，I(x,y)为原始图像中I中(x,y)位置的值。In the formula, I(x,y) is the value of the (x,y) position in I in the original image.

由于抑制了图像最显著的区域，从而分散了注意力，迫使神经网络从其他区域学到判别性信息。还可以降低网络对训练样本的依赖，防止过拟合，进一步提高了模型的鲁棒性。Distraction by suppressing the most salient regions of the image forces the neural network to learn discriminative information from other regions. It can also reduce the network's dependence on training samples, prevent overfitting, and further improve the robustness of the model.

步骤三、同类对比模块学习步骤：Step 3. Learning steps for similar comparison modules:

A、图像采样步骤：随机选择同一类别下其他N张图像作为正样本。A. Image sampling step: randomly select other N images in the same category as positive samples.

B、特征融合步骤:将目标图像和随机采样的正样本图像经步骤一得到的双线性特征向量进行融合，得到的融合特征综合了同一类别下多张图像的特征信息。B. Feature fusion step: fuse the target image and the randomly sampled positive sample image through the bilinear feature vector obtained in step 1, and the obtained fusion feature integrates the feature information of multiple images under the same category.

C、融合特征损失函数计算：将融合后的特征向量直接输入全连接层得到预测概率，对得到的类别预测值计算多分类交叉熵。C. Calculation of fusion feature loss function: The fused feature vector is directly input into the fully connected layer to obtain the prediction probability, and the multi-class cross entropy is calculated for the obtained category prediction value.

结合图3，步骤A随机选择与输入图像属于同一类别的N张图像，都送入步骤一的双线性特征提取网络。Referring to Figure 3, step A randomly selects N images belonging to the same category as the input image, and sends them to the bilinear feature extraction network in step 1.

步骤B对步骤A输出的多张同类图像的双线性特征向量求平均，得到一个融合的特征向量：Step B averages the bilinear feature vectors of multiple images of the same type output in step A to obtain a fused feature vector:

式中，j为特征向量的位置，V(j)为特征向量在第j个位置的值，T为选取的正样本数量；V_r(j)为第r个正样本在在第j个位置的值；In the formula, j is the position of the feature vector, V(j) is the value of the feature vector at the jth position, T is the number of positive samples selected; V _r (j) is the rth positive sample at the jth position. the value of;

步骤四、中心损失计算步骤；Step 4: Calculate the center loss;

A、类中心生成步骤：在训练过程中，不断更新网络学到的各个类别中心的特征向量。A. Class center generation step: During the training process, the feature vector of each class center learned by the network is continuously updated.

B、中心损失计算步骤：以每个输入图像得到的双线性特征向量与类中心向量之前的距离作为中心损失，在训练过程中不断优化。B. Center loss calculation step: The distance between the bilinear feature vector obtained from each input image and the class center vector is used as the center loss, which is continuously optimized during the training process.

本实施方式中，对每个类别都计算一个特征向量作为对应类别的类中心，这个特征向量随着训练的进行会不断的更新。通过惩罚每个样本的双线性特征向量和对应类别样本中心的偏移，使得同一类别的样本尽量聚合在一起，省去了复杂的样本对构造过程。令v_i为第i个样本的双线性特征，c_i为样本i对应类别所有样本的平均特征，即类中心，N为当前批量的样本个数，公式如下：In this embodiment, a feature vector is calculated for each category as the class center of the corresponding category, and this feature vector will be continuously updated as the training progresses. By penalizing the offset of the bilinear feature vector of each sample and the center of the corresponding class of samples, the samples of the same class are aggregated together as much as possible, eliminating the complex sample pair construction process. Let vi be the bilinear feature of the _ith sample, ci be the average feature of all samples of the corresponding category of sample _i , namely the class center, N is the number of samples in the current batch, the formula is as follows:

步骤五、模型优化损失函数计算步骤：Step 5. Model optimization loss function calculation steps:

将原始图像的双线性特征的交叉熵损失函数，抑制图像的双线性特征的交叉熵损失函数，融合特征的交叉熵损失函数以及中心损失函数加权求和，即可得到模型优化的损失函数。The weighted summation of the cross-entropy loss function of the bilinear feature of the original image, the cross-entropy loss function of the suppressed bilinear feature of the image, the cross-entropy loss function of the fusion feature, and the central loss function can obtain the optimized loss function of the model .

结合图4，记原始图像的双线性特征向量的交叉熵损失为L_CE1，抑制图像的双线性特征向量的交叉熵损失为L_CE2，融合特征的交叉熵损失为L_CE3,中心损失为L_C，将这些损失函数加权求和，即可得到模型优化的损失函数L：Referring to Figure 4, the cross-entropy loss of the bilinear feature vector of the original image is L _CE1 , the cross-entropy loss of the bi-linear feature vector of the suppressed image is L _CE2 , the cross-entropy loss of the fusion feature is L _CE3 , and the center loss is L _C , the weighted summation of these loss functions, the loss function L optimized by the model can be obtained:

式中，λ为中心损失函数的权重。where λ is the weight of the central loss function.

结合图5说明本实施方式，图5中，第一行为数据集中随机选取的原始图像，第二行为原始图像输入的全局分支获得的类激活图，第三行为抑制分支获得的类激活图。可以看出，在全局分支中，网络学习到了图像最显著的区域，如鸟的喙、汽车前照灯等，在抑制分支中，网络学习到了有利于细粒度分类的微妙的特征，如鸟的躯干、车轮等。多视角结合，使得网络模型的判断依据更为全面，既能显著性区域，又能微妙地捕捉到细粒度特征。This embodiment is described with reference to FIG. 5 . In FIG. 5 , the first row is a randomly selected original image in the data set, the second row is a class activation map obtained by the global branch input from the original image, and the third row is a class activation map obtained by the suppression branch. It can be seen that in the global branch, the network learns the most salient regions of the image, such as bird's beak, car headlights, etc., and in the suppression branch, the network learns subtle features that are conducive to fine-grained classification, such as bird's beak, car headlights, etc. Torso, wheels, etc. The combination of multiple perspectives makes the judgment basis of the network model more comprehensive, which can capture both salient regions and fine-grained features subtly.

本实施方式所述的细粒度图像识别方法，引入了一种新的数据增强方式，通过注意图引导来抑制图像中的部件区域，从而分散注意力，使网络学习到更多互补区域特征信息。引入了一个同类对比模块，融合了来自同一类别多张图像的特征信息，使同一类别图像的表示在嵌入空间尽可能接近，从而提高了分类性能。The fine-grained image recognition method described in this embodiment introduces a new data enhancement method, which suppresses component regions in the image through attention map guidance, thereby distracting attention and enabling the network to learn more complementary region feature information. A homogenous comparison module is introduced, which fuses feature information from multiple images of the same category to make the representations of images of the same category as close as possible in the embedding space, thereby improving the classification performance.

以上所述实施例的各技术特征可以进行任意的组合，为使描述简洁，未对上述实施例中的各个技术特征所有可能的组合都进行描述，然而，只要这些技术特征的组合不存在矛盾，都应当认为是本说明书记载的范围。The technical features of the above-described embodiments can be combined arbitrarily. For the sake of brevity, all possible combinations of the technical features in the above-described embodiments are not described. However, as long as there is no contradiction between the combinations of these technical features, All should be regarded as the scope described in this specification.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对发明专利范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。因此，本发明专利的保护范围应以所附权利要求为准。The above-mentioned embodiments only represent several embodiments of the present invention, and the descriptions thereof are specific and detailed, but should not be construed as a limitation on the scope of the invention patent. It should be pointed out that for those skilled in the art, without departing from the concept of the present invention, several modifications and improvements can be made, which all belong to the protection scope of the present invention. Therefore, the protection scope of the patent of the present invention shall be subject to the appended claims.

Claims

1. a fine-grained image recognition method based on multi-view feature fusion, is characterized in that: the method is realized by the following steps:

Step 1. Bilinear feature extraction;

The original image is input into the bilinear feature extraction network, and the feature maps output by different convolutional layers are fused to obtain bilinear feature vectors; the feature extraction network adopts the network structure pre-trained in the dataset ImageNet;

Step 2: Suppress branch learning, the specific process is:

Step 21: Generate an attention map according to the size and threshold of the feature maps output by different convolutional layers of the feature extraction network described in step 1;

Step 22, generating a suppression mask according to the attention map described in step 21, and overlaying the suppression mask on the original image to generate a suppressed image in which the local area is masked;

Step 2 and 3: Use step 1 to perform bilinear feature extraction on the suppressed image described in step 2, obtain bilinear feature vector, input the bilinear feature vector to the fully connected layer, and obtain the predicted class probability value, And calculate the multi-class cross entropy for the predicted class probability value;

Step 3. Similar comparison module learning;

Step 31: Randomly select other N images in the same category as the original image as positive sample images;

Step 32: The target image and the positive sample image described in step 31 are sent to the feature extraction network described in step 1 for bilinear feature vector fusion, and the fusion feature is obtained. Feature vector;

Step 33: Average the bilinear eigenvectors of multiple images of the same category obtained in step 32 to obtain a fused feature vector, and input the fused feature vector into the fully connected layer to obtain the predicted probability. Calculate the multi-class cross entropy of the predicted probability of the same category;

Step 4: Calculate the central loss function _LC ;

Let v _i be the bilinear feature of the ith sample, c _i be the average feature of all samples in the corresponding category of sample i, that is, the class center, and N be the number of samples in the current batch, then the formula of the center loss function L _C is as follows:

Step 5. Model optimization loss function calculation;

The cross-entropy loss function of the bilinear feature vector of the original image, the cross-entropy loss function of the bilinear feature vector of the suppressed image, the cross-entropy loss function of the fusion feature, and the central loss function are weighted and summed to obtain the loss function of the model optimization. .

2. a kind of fine-grained image recognition method based on multi-view feature fusion according to claim 1, is characterized in that: in step 21, the specific process of generating attention map is:

Feature map output from the last convolutional layer in the feature extraction network

Calculate the average value p _d of each channel of , where D is the number of channels of the feature, H and W are the height and width of the feature map, respectively; according to the average value, the formula for obtaining the entropy E is:

Construct an attention map A by comparing the magnitude of entropy and threshold δ:

In the formula, F _k is the two-dimensional feature map corresponding to each channel sorted by channel;

In step 22, the specific process of generating the suppression mask is as follows:

Enlarge the attention map in step 21 to the original image size, calculate the average value m of the attention map, set a threshold θ between 0 and 1, take m*θ as the threshold, and set the attention map larger than the threshold Elements of m*θ are set to 0, other elements are set to 1, and the suppression mask M is obtained:

In the formula, A(x, y) is the value of the (x, y) position in the attention map A;

Covering the suppression mask to the original image to obtain the suppressed image Is ( _x , y) where the local area is masked;

Is (x,y)=I( _x ,y)*M(x,y)

In the formula, I(x,y) is the value of the (x,y) position in I in the original image.

3. a kind of fine-grained image recognition method based on multi-view feature fusion according to claim 1 is characterized in that: in step 33, the bilinear feature vectors of multiple images under the same category are averaged to obtain a The fused feature vector is expressed as:

In the formula, j is the position of the feature vector, V(j) is the value of the feature vector at the jth position, T is the number of positive samples selected; V _r (j) is the rth positive sample at the jth position. the value of;

4. a kind of fine-grained image recognition method based on multi-view feature fusion according to claim 1, is characterized in that: in step 5, the cross entropy loss function of the bilinear feature vector of the original image is L _CE1 , suppressing the image The cross-entropy loss function of the bilinear feature vector is L _CE2 , the cross-entropy loss function of the fusion feature is L _CE3 , and the central loss function is L _C , which are weighted and summed to obtain the model-optimized loss function L, and finally realize Fine-grained image recognition; expressed as:

In the formula, λ is the weight of the central loss function.