CN113159067A

CN113159067A - Fine-grained image identification method and device based on multi-grained local feature soft association aggregation

Info

Publication number: CN113159067A
Application number: CN202110392237.9A
Authority: CN
Inventors: 孔建磊; 金学波; 王小艺; 苏婷立; 白玉廷; 王宏兴
Original assignee: Beijing Technology and Business University
Current assignee: Beijing Technology and Business University
Priority date: 2021-04-13
Filing date: 2021-04-13
Publication date: 2021-07-23

Abstract

The invention provides a fine-grained image identification method based on multi-grained local feature soft association aggregation, which comprises the following steps: carrying out enhanced data preprocessing on an image to be identified to obtain an enhanced image; extracting features of the enhanced image to obtain coarse-grained features, fine-grained features and medium-grained features respectively; and performing soft association aggregation on the coarse-grained features, the fine-grained features and the medium-grained features to obtain an image identification result of the image to be identified. According to the method, the multi-granularity local features are extracted by using a multi-stream parallel hybrid network, different dimensional features are effectively fused by using a soft correlation feature aggregation mode, parameter redundancy is eliminated, information complementation is realized, uniform probability description which finally represents fine-granularity identification is formed, and the identification precision and timeliness are improved. Experimental results show that the method performs well in the aspect of classification precision, and can be combined with other models to generate better results in the future.

Description

Fine-grained image identification method and device based on multi-grained local feature soft association aggregation

Technical Field

The present application relates to the field of image processing, and in particular, to a fine-grained image identification method and apparatus based on multi-grained local feature soft association aggregation.

Background

Deep learning is an important branch of research in machine learning. In recent years, the neural network with the deep structure has great theoretical and method breakthrough in the aspect of image pattern recognition, and has mature application in the aspects of monitoring systems, intelligent robots, video analysis and the like. Compared with the traditional machine learning, the deep learning network carries out hierarchical adjustment on plant diseases and insect pests by utilizing a network structure which is progressive layer by layer, the recognition of the plant diseases and insect pests is not required to be completed by artificially designing mode characteristics, and the end-to-end training learning process is realized. The method for extracting the features layer by designing the features of different layers according to the data by the network per se has theoretically obtained breakthrough progress, and the deep learning open source platform which is vigorously developed by enterprises, college open research communities and the like enables deep learning models such as a convolutional neural network and a cyclic neural network to become basic modules for solving various problems. The Convolutional Neural Network (CNN) is particularly widely applied to the field of computer vision, can autonomously learn implicit expression relations among image pixel characteristics, bottom layer characteristics and high-level abstract characteristics until final categories, is more favorable for capturing rich connotation information of data, avoids a complex manual design process, and achieves great success in a series of large-scale and open actual recognition tasks.

However, image classification in a complex actual environment is still a very challenging task, and each type of actual object belongs to a dynamic visual target with complex characteristic and properties, fine-grained characteristics and various subclasses, and environmental factors, background interference and equipment posture change make the actual object become a dynamic fine-grained image recognition (FGVC) problem, which makes the technical difficulty greater. However, the current deep learning method can only obtain static and coarse-grained characteristic descriptions, and cannot reflect interspecies subtle differences and intraspecies fine-grained characteristics with obvious dynamic changes contained in a large amount of data, which directly causes that the existing deep migration network model is difficult to match with an actual application system, and the identification research on fine-grained images cannot be smoothly and deeply developed.

At present, there are two main types of fine-grained image classification methods: a strong supervised learning method and a weak supervised learning method. Among other things, the strong supervised learning approach has a large impact on the speed of the algorithm, and it also relies on additional location annotations and expensive manual labeling, which makes it difficult to apply universally in practice. In recent years, partial research tends to improve the fine-grained characteristic mining capability of traditional deep migration learning by using a weak supervision information method, the method optimizes the end-to-end characteristic coding and self-adaptive local perception performance of a coarse-grained deep network by means of an attention mechanism, a covariance high-order operation theory, a multi-branch filtering combination theory and the like, limited cognitive resources are distributed according to the internal association and task requirements of data, key parts can be automatically extracted without additional supervision information, and fine-grained characterization with better discrimination can be directly learned. In terms of technical performance, attention mechanism and multi-branch filtering theory generate a large number of candidate local regions, so that the classification accuracy is relatively high, but the recognition speed meets the bottleneck and the real-time requirement in a complex scene is difficult to meet; the covariance high-order operation theory does not need to generate a candidate region, and is superior in algorithm speed, but the detection precision is insufficient and the parameter quantity is too high. However, the existing fine-grained identification methods are not suitable for complex scene application, and are difficult to match with an actual system platform and related intelligent terminal deployment.

Disclosure of Invention

In order to solve one of the above technical problems, the present invention provides a fine-grained image identification method and apparatus based on multi-grained local feature soft association aggregation.

The first aspect of the embodiments of the present invention provides a fine-grained image identification method based on multi-grained local feature soft association aggregation, where the method includes:

carrying out enhanced data preprocessing on an image to be identified to obtain an enhanced image;

extracting features of the enhanced image to obtain coarse-grained features, fine-grained features and medium-grained features respectively;

and performing soft association aggregation on the coarse-grained features, the fine-grained features and the medium-grained features to obtain an image identification result of the image to be identified.

Preferably, the preprocessing of the enhanced data on the image to be recognized to obtain the enhanced image includes:

and cutting and zooming, randomly turning, randomly rotating and changing the saturation and brightness of the picture to be identified to obtain an enhanced image.

Preferably, the process of respectively obtaining coarse-grained features, fine-grained features and medium-grained features by performing feature extraction on the enhanced image includes:

constructing a backbone network;

processing the enhanced image through the backbone network and outputting a multidimensional vector;

and respectively inputting the multi-dimensional vectors into a coarse-granularity feature extractor, a fine-granularity feature extractor and a medium-granularity feature extractor to obtain coarse-granularity features, fine-granularity features and medium-granularity features.

Preferably, the backbone network comprises an input module, four CSP Stage modules and a pooling layer;

the process of outputting the multidimensional vector after processing the enhanced image through the backbone network comprises:

inputting the enhanced image into a first CSP Stage module after passing through an input layer to obtain a first output, wherein each CSP Stage module comprises a path b and a path c, the input of the first CSP Stage module passes through the path b to obtain the output of the path b, the path b comprises a convolutional layer and a plurality of residual blocks, each residual block comprises three convolutional layers, the output of the residual block obtained after the input of the residual block passes through the three convolutional layers is added with the input of the residual block to obtain the output of the residual block, the output of the residual blocks is the output of the path b, the input of the first CSP Stage module passes through the path c to obtain the output of a path c, the path c comprises a convolutional layer, and the output of the first CSP Stage module after the input of the first CSP Stage module passes through the convolutional layer of the path c is the output of the path c, splicing the output of the path b and the output of the path c, and inputting the spliced output into a convolution layer to obtain a first output;

inputting the first output to a second CSP Stage module, obtaining the output of a path a through a down-sampling layer by the first output, and obtaining a second output after taking the output of the path a as the input of a path b and a path c in the second CSP Stage module;

inputting the second output to a third CSP Stage module, wherein the second output passes through a down-sampling layer to obtain the output of a path a, and the output of the path a is used as the input of a path b and a path c in the third CSP Stage module to obtain a third output;

and inputting the third output to a fourth CSP Stage module, obtaining the output of a path a through a down-sampling layer by the third output, and obtaining the multidimensional vector after taking the output of the path a as the input of a path b and a path c in the fourth CSP Stage module.

Preferably, the process of inputting the multidimensional vector to a coarse-grained feature extractor to obtain coarse-grained features comprises:

and adding the multidimensional vectors to calculate the average value to obtain coarse granularity characteristics.

Preferably, the process of inputting the multidimensional vector to a fine-grained feature extractor to obtain fine-grained features comprises:

performing attention mechanism processing on the multidimensional vector to obtain an attention diagram corresponding to a specific part in the enhanced image;

and performing product calculation on the multidimensional vector and the vector corresponding to the attention map, and extracting features through convolution to obtain fine-grained features.

Preferably, the process of inputting the multidimensional vector to a medium-granularity feature extractor to obtain medium-granularity features comprises:

performing dimensionality reduction processing on the multidimensional vector to obtain a dimensionality reduction vector;

carrying out covariance matrix calculation on the dimensionality reduction vector, and carrying out pre-normalization processing on a calculation result to obtain a processing vector;

performing post-compensation after performing multiple times of Newton-Schulz iterative computation on the processing vector to obtain a compensation result;

obtaining a fusion vector by aggregating the coarse-grained characteristic, the fine-grained characteristic and the compensation result;

and carrying out normalization and convolution calculation on the fusion vector to obtain medium granularity characteristics.

A second aspect of the present invention provides a fine-grained image recognition method apparatus based on multi-grained local feature soft association aggregation, where the apparatus includes a processor configured with operating instructions executable by the processor to perform the method steps according to the first aspect of the present invention.

A third aspect of embodiments of the present invention provides a computer-readable storage medium, which includes a computer program and which, when run on an electronic device, causes the electronic device to perform the method steps as described in the first aspect of embodiments of the present invention.

A fourth aspect of embodiments of the present invention provides a chip, coupled to a memory, for executing a computer program stored in the memory to perform the method steps according to the first aspect of embodiments of the present invention.

The invention has the following beneficial effects: the invention provides a fine-grained image identification method based on multi-grained local feature soft-associated aggregation, which is characterized in that multi-grained local features are extracted by using a multi-stream parallel hybrid network, different dimensional features are effectively fused by using a soft-associated feature aggregation mode, parameter redundancy is eliminated, information complementation is realized, uniform probability description for finally representing fine-grained identification is formed, and identification precision and timeliness are improved. Experimental results show that the method performs well in the aspect of classification precision, and can be combined with other models to generate better results in the future.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of a fine-grained image identification method based on multi-grained local feature soft association aggregation according to embodiment 1 of the present invention;

fig. 2 is a schematic diagram of enhanced data preprocessing performed on an image to be recognized according to embodiment 1 of the present invention;

fig. 3 is a schematic structural diagram of a backbone network according to embodiment 1 of the present invention;

fig. 4 is a schematic structural diagram of a bilinear attention pooling module according to embodiment 1 of the present invention;

FIG. 5 is a schematic structural diagram of an iSTRT-COV according to embodiment 1 of the present invention;

FIG. 6 is a comparison chart of the extraction results of different granularity feature modules according to embodiment 1 of the present invention;

fig. 7 is a schematic diagram of a NetVLAD feature aggregation module according to embodiment 1 of the present invention;

FIG. 8 is a diagram illustrating the accuracy of each type in CUB-200 and 2011;

FIG. 9 is a comparison graph of the classification results of CUB-200 and 2011, class 102;

FIG. 10 is a schematic diagram of the accuracy of each class of Stanford Cars;

FIG. 11 is a comparison of the results of classification of Stanford Cars class 24;

FIG. 12 is a schematic diagram of the accuracy of each class of Stanford Dogs;

FIG. 13 is a comparison of the results of classification by Stanford Dogs class 1.

Detailed Description

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example 1

As shown in fig. 1, the present embodiment provides a fine-grained image recognition method based on multi-grained local feature soft association aggregation, where the method includes:

s101, performing enhanced data preprocessing on an image to be identified to obtain an enhanced image;

s102, extracting features of the enhanced image to obtain coarse-grained features, fine-grained features and medium-grained features respectively;

s103, performing soft association aggregation on the coarse-grained features, the fine-grained features and the medium-grained features to obtain an image identification result of the image to be identified.

Specifically, in this embodiment, data is enhanced through some image preprocessing steps, which plays a role in expanding the data set and increasing the generalization capability of the model. The steps can include set transformation such as random cutting and scaling, the size of the scaled image is consistent, for example, the size is 224 multiplied by 224, and the problem of shooting angle is solved; randomly flipping horizontally and vertically by 180 ° to increase image diversity; the deformation adaptability is improved by randomly 90 degrees, 180 degrees and 270 degrees; changing picture saturation and brightness to increase illumination variation; random addition of noise, etc. The results of these methods can be as shown in fig. 2, and the mixup enhancement method is used on the data:

wherein x is_aAnd x_bIs the original input image, y_aAnd y_bIs the label corresponding to the image, and λ is the value range [0,1 ]]The random number of (2). Finally obtainThe enhanced image and the new label are reached.

After the enhanced image is obtained, a cross-level multi-stream feature extractor is constructed, which comprises four branches, and the same CSPRESNext50 is adopted as a feature extraction backbone network. The basic CSPResNext50 network consists of one input layer, four CSP Stage modules and finally a pooling layer as shown in fig. 3. The input layer contains a convolution layer with convolution kernel 7 × 7, step size 2, and output 64 channels, and a max pooling layer with convolution kernel 2 × 2, step size 2. Then, the path b passes through a convolution kernel of 1 × 1, the step length is 1, and the output is a convolution layer of 64 channels, and then passes through several residual blocks, which also include two paths, path 1 passes through 3 convolution layers, the sizes of their convolution kernels are 1 × 1, 3 × 3, and 1 × 1, respectively, the number of output channels is 128, and the input of path 2 is directly added with the result of path one to obtain the output of a residual block. The remaining residual blocks are also the same calculation process, resulting in the output of path b. Path c also passes through a convolution kernel of 1 × 1, step size 1, and output is a convolution layer of 64 channels, and then is directly spliced with the output of path b, and passes through a convolution layer of 1 × 1, and output of 256 channels is obtained. The difference between the remaining three CSP Stage modules and the first CSP Stage module is that in the remaining CSP Stage modules, firstly, a downsampling layer with a convolution kernel of 3 × 3 is passed through to obtain the output of a path a, then the output of the path a passes through a path b and a path c respectively, the subsequent part is the same as the first CSP Stage module, and the output dimension is doubled after passing through each CSP Stage module. The formula in each CSP Stage module is as follows:

x_a＝f_3×3(x)

g(x_b)_i＝f_Res(X)(g(x_b)_i-1)

where x is the input to the CSP Stage module, f_3×3() Representing a down-sampled layer, x_aIn order to down-sample the output of the layer,

represents the convolutional layers, x, through which the paths b and c pass_b,x_cRespectively represent

Output of g (x)_b)₀Represents the input of the 0 th residual block, g (x)_b)_i,g(x_b)_i-1Respectively representing the output and input of the ith residual block, f_Res(X)() Which represents the calculation of the residual error on the input,

indicating that a and B are spliced in the channel dimension. In fig. 3, the number of residual blocks in the four CSP Stage modules is n ═ 3,3,5, 2. The backbone network finally outputs a vector F of 2048 dimensions:

wherein H_csp() All of the layers of the backbone network are represented,

is the input to the backbone network. The activation function in the network uses Mish, and the expression is as follows:

f(x)＝x·tanh(ln(1+e^x))

after the backbone network is constructed, the multidimensional vector output by the backbone network is used as input to be extracted in a feature extractor, so that coarse-grained features, fine-grained features and medium-grained features are obtained respectively.

In this embodiment, Global Average Pooling (GAP) is employed as the coarse-grained feature extractor. The GAP is used for adding all pixel values of the feature map to make a tie to obtain a numerical value, i.e. the numerical value represents the corresponding feature map. The global average pooling replaces the full-connection layer, so that excessive parameters caused by the full-connection layer are reduced, the calculation amount is effectively reduced, and overfitting is reduced. The global average pooling is used for regularizing the whole network from the structure to prevent overfitting, the characteristics of black box operation of a full-connection layer are removed, the actual class meaning of each channel is directly given, and the coarse-grained characteristics which are good enough can be effectively extracted.

f＝P_GAP(F)

Wherein P is_GAPThe global average pooling is represented, F is the feature extracted by the backbone network, and after the step, the coarse-grained representation F of the image is obtained.

In this embodiment, a bilinear attention pooling module (BAP) is used to construct the fine-grained feature extractor. BAP is another attention pooling layer for extracting fine-grained features, each of which can be directed to a specific part of an object by using an attention mechanism on feature vectors. A feature map of the local object of interest is generated by multiplying the attention map by the feature map, and then the features are extracted by convolution. For example, 64 attention maps may be generated, which allow focusing on different 64 portions of a feature. Feature map F_iI ═ 1, 2.. times.n) and each attention map a_kElement multiplication is carried out to obtain M local feature maps:

F_m＝a_k⊙F_i

wherein k is (1,2, …,64), F_mAll attention diagrams corresponding to one feature diagram are shown, after the local feature diagrams are obtained, the local features are put into a feature extraction function g (), and a 64-dimensional vector is obtained:

f_m＝g(F_m)

after N characteristic graphs are sequentially operated, M multiplied by N is generatedDimension vector, finally extracting the local feature F with most discrimination by one layer of convolution₂：

Wherein, P_BAP() Representing the final feature extraction, F is the feature extracted by the backbone network, a_kFor the Kth attention diagram, which indicates the multiplication of elements of two feature formulas, g () indicates a feature extraction function, the present embodiment uses global average pooling in the network. In the experiment, the use of M-64 attention maps was chosen for better attention to fine-grained features, P_BAP() In the method, 131072 features are extracted to obtain F with dimension of 2048₂As a feature of the image. Finally, the modules are stacked to obtain different local feature vectors for representing the fine-grained image, and a comparison graph of coarse-grained features and fine-grained features extracted by different modules is shown in fig. 4.

The iSTRT-COV is a second-order covariance pooling layer that can be combined with other backbone networks, and an iterative matrix square root algorithm is used to quickly train the global covariance matrix pooling end-to-end. The second-order covariance pooling replaces the traditional first-order pooling, such as global maximum pooling or global average pooling, can concern about the correlation between different channels, improve the modeling capability of the CNN on complex features, and better concern about fine-grained information. The concrete structure of the pooling layer is a meta-layer with a circularly nested directed graph, the meta-layer consists of three continuous layers, pre-regularization is respectively carried out, and a covariance matrix is divided according to the trace or the F-norm of the covariance matrix so as to ensure the convergence of Newton-Schulz iteration of the next stage; Newton-Schulz iteration, which is to perform a certain number of iterations of the coupling matrix equation to calculate a proper square root of the matrix; and after compensation processing, the first layer greatly adjusts the magnitude of input data, so that the third layer needs to be multiplied by the trace of a square root matrix when being designed, and the specific structure is shown in fig. 5. Enter intoBefore the covariance pooling layer, firstly, the dimension of the feature channel is reduced, representative features are extracted, and the dimension of the feature vector obtained after the dimension reduction is 128 × 7 × 7 and adjusted to be a tensor of 128 × 49. The covariance matrix of the tensor is then computed and entered into the meta-layer. In the meta-layer, firstly ensuring the astringency of the subsequent iteration through pre-normalization, then approximating the square root of the matrix through multiple Newton-Schulz iterations, and finally obtaining the final result F through post-compensation₁：

F₁＝P_iSORT(F,{∑_F,N,Y,Z})

Wherein the function P_iSORTRepresenting all the calculation processes of the iSTRT-COV, wherein F is the characteristic extracted by the backbone network, sigma_FThe covariance matrix of the input features F is input, N is iteration times, and Y and Z are intermediate variables in the iteration process and are used for calculating the return gradient. In the network, the covariance matrix Σ_FIs 128 x 128, resulting in an output F₁A vector of 8256 dimensions is used as a feature of the image.

In the process, after the features are extracted through three branch networks, 3 groups of fine-grained features used for representing the image are obtained respectively, different feature extractor extractions have different preferences for feature extraction of the image, as shown in fig. 6, in order to better classify and reduce the over-fitting effect of the model, NetVLAD is used for aggregating the features, and medium-grained features are extracted, as shown in fig. 7.

The NetVLAD captures local feature information about aggregation on an image, firstly, clustering three groups of input features by using a k-means clustering algorithm to obtain k clustering centers, and marking as C_kAnd the feature vector space is subdivided according to the clustering centers. Calculating the residual distance of each characteristic local clustering center, and storing the residual sum between each local characteristic and the corresponding clustering center; and quantizing the features of each image, aggregating each local feature at the nearest cluster center, and dividing the feature space into a plurality of unit subspaces after quantization. Features of the same clustering space are clustered into a whole feature representation:

wherein j denotes the current feature description, c_i(j) Is represented by F_lProximity of c_iThe jth component of (a); in training c_iAre updatable parameters, making the VLAD layer more flexible; but in a conventional VLAD, since a_i(x_l) Is a discontinuous value having a value of 1 or 0 and satisfies

Is an expression in a discrete state and therefore cannot be propagated backwards. Therefore, the membership degree a from the features to the clustering center in the traditional clustering strategy is rewritten into a soft association form, and after the soft association processing, the NetVLAD realizes the residual error statistics of the feature distribution and the clustering center. Using weight assignment then:

wherein, l ═ {1,2,3} corresponds to the multi-stream feature extraction layer output, i represents the ith clustering center, and i' represents the non-clustering center; w is a_i、b_iAnd w_i′、b_i′Respectively representing updatable parameters of corresponding training stages; a is_i(F_l) Represents a feature F_lMembership to the ith cluster center; the final vector is obtained by combining the feature sums of all components through the soft weight distribution:

and carrying out normalization and convolution operation on the vector F' to obtain a final output aggregation vector V of the NetVLAD layer, wherein the vector is the medium-granularity characteristic of the original image.

And after the fusion feature is obtained, the feature is the final feature obtained by the input image through the feature extraction network and is used for subsequent classification. The label smoothing method is used for reducing the over-fitting effect of the model:

where y is represented as a sample label, ε is a smoothing factor, and u is a class score. The label smoothing prompts the classification probability result after the softmax activation function is activated in the neural network to approach to the correct classification and to be far away from the wrong classification as far as possible, so that the classification performance is improved. The final loss function for this embodiment is therefore:

wherein L is_fuseLoss function, L, representing a probability fusion module_featureA loss function representing the multi-retained feature extraction layer, λ ∈ [0,1 ]]Representing a weighting factor for balancing the losses of the two modules, W representing the number of classes,

for the smoothed label, P_cIs the probability that the sample belongs to class c. In an implementation case, the above-mentioned loss function is used to optimize the whole network structure, and improve the performance of fine-grained classification. Through the above operations, the proposed fusion model obtains an overall representation of the prediction score from the decision-level perspective, and in fact, a joint posterior probability is obtained by integrating a plurality of prior probabilities from each component model of the multi-stream feature extraction layer and the clustering layer.

Example 2

Corresponding to embodiment 1, this embodiment proposes an apparatus for fine-grained image recognition based on multi-grained local feature soft association aggregation, where the apparatus includes a processor configured with processor-executable operating instructions to perform the following steps:

Specifically, the specific working principle of the device according to this embodiment may refer to the content described in embodiment 1, and is not described herein again. According to the embodiment, multi-granularity local features are extracted by using a multi-stream parallel hybrid network, different dimensionality features are effectively fused by using a soft correlation feature aggregation mode, parameter redundancy is eliminated, information complementation is realized, uniform probability description representing fine granularity identification finally is formed, and identification precision and timeliness are improved. Experimental results show that the method performs well in the aspect of classification precision, and can be combined with other models to generate better results in the future.

The practical application effect of the method provided by the invention is illustrated by three specific experiments, namely CUB-200-2011 (California institute of technology-UCSD bird), Stanford automobile and Stanford dog. Detailed statistics including class number and data split are summarized in table 1.

TABLE 1

Example 1

The experiments were first performed on the cub-200-2011 dataset and the accuracy (ratio between the number of correctly classified images and the number of test images) was used to evaluate the performance. The overall accuracy of the model built, compared to some of the previous methods, is shown in table 2 below.

TABLE 2

As shown in table 2, the result obtained by using the probabilistic fusion decision model is much higher than the current excellent model, and the accuracy result obtained by the model is 91.2%. In contrast, the optimal strong supervision method (SPDA-CNN) using the training label box to classify fine-grained data only achieves 85.2% accuracy, 6% lower than the fusion model of the present invention, it only requires dependent labels, and the performance is even better than those fusion models that require additional information. For weak regulatory training without training box, the DFLNet and GMNet models achieved 87.5% and 86.3% accuracy, respectively, which also demonstrated the effectiveness of the inventive framework.

For each component model of the multi-stream feature extractor, the accuracy of the csprasenext 50 model is 86.6%, relying only on coarse image-level labels. Practice has shown that the deep structure of cspraesenext 50 can extract impressive and effective feature maps suitable for further focusing on local information. Similarly, the BAP accuracy is 88.8%, and the iSTRT-COV accuracy is 87.2%, which shows that more skills and operations obtain rich local feature representation of distinguishing object parts, and the accuracy of solving a fine-grained visual classification task is improved. After the fusion processing of the invention, the precision of the proposed probability fusion module in the range of 2.4-4.6% is greatly improved compared with the prior single model. Even if the two models are combined, the fusion module of the invention can still play an important role, and excavate the complementary characteristics of different models to obtain higher precision performance. The result shows that the combined form of BAP and iSTRT-COV obtains 90.3% precision, and the other precision obtained by combining BAP and DFL is up to 89.7%, and the performances of both methods are superior to those of a single component model or other weak supervision learning methods. The decision layer viewpoint probability fusion module can utilize the mixed granularity information of multiple CNNs only by using the image-level label. And the inner and outer ring fusion module which is realized end to end effectively improves the overall precision of the fine-grained vision classification problem.

According to the obtained result, the detailed precision of each subclass and three models provided by the multi-stream fine-grained extraction module is further analyzed before the CUB-200-2011 data set is fused. The quantitative analysis of each model is shown in FIG. 8. Although the recognition capabilities of the models are different for different classes, it can be seen that the accuracy curve trend is relatively flat and stable after cross-level fusion (the thickest line). This means that the fusion strategy can mine some small inter-class differences that individual models do not have, to improve the recognition rate of different sub-classes, thereby further balancing the uncertainty between different models. The model combines complementary feature mapping and has strong inter-class difference identification capability on images of different fine-grained types. For example, DFL has an accuracy of 16.7% in category 58 and 66.7% in category 66. After the modules are fused, the precision of the 58 th class is improved to 53.3 percent, and the corresponding accuracy of the 66 th class is up to 83.3 percent. The probability fusion module reduces the identification difference of single models of different classes truthfully, thereby improving the overall precision.

The accuracy of different models in the same category is significantly greater, in contrast. For example, in the 102 th class under the species name sayonnis, the accuracy of DFL is 80.3%, and the accuracy of BAP and iSQRT-COV is 86.7% and 90.2%, respectively. The result shows that the fusion technology effectively inhibits the intra-class variation between different images and samples of the same subclass. In the fusion method of the present invention, the improvement of fusion accuracy in all categories is limited to some extent due to interference factors such as position change or illumination change, so that the fusion accuracy in category 102 reaches a surprising 100%. The present invention analyzes the supplementary process of the error in the 102 th class in detail as shown in fig. 9. For 20 pictures of the Sayornis class, the bold frame indicates that the model predicts intra pictures as other classes of errors, and the rest indicate correct predicted pictures. The results show that all three component models made some errors, only the model of the present invention is completely correct. The result proves that the cross-stage fusion can reasonably select domain information from various feature maps extracted from each model, enhance the controllability of inter-class and intra-class variables and improve the overall performance of a fine-grained visual classification task.

Example 2

This example is the FGVC experiment performed in the stanford automotive dataset. Table 3 lists the final accuracies of the PFDM-Net model and some of the latest competitors.

TABLE 3

As shown above, the accuracy of the different models is relatively higher than that of the CUB-200-2011 result. The main reason is that the trainable and testable images are more and the categories are less, so that the requirement of deep learning training is met, and the performance of the training is maximized. Furthermore, the differences between vehicles are significant compared to birds in complex environmental backgrounds, so better accuracy results are easily obtained. Even in this way, the invention still obtains the best performance by 95.2% on the precision index, which is 1.3% higher than DFL, and the accuracy is 93.9%. The model is suitable for different data sets and tasks, has good generalization performance, and provides an effective method for solving other actual fine-grained visual classification tasks with more complex backgrounds.

In contrast, the respective component models in the multi-stream architecture also obtain good results respectively. Accuracy of DFL was 92.9%, close to accuracy of SPDA-CNN 93.1%, and with abundant partial annotation; while BAP accuracy was 93.6% and iSTRT-COV accuracy was 93.3%. After comparison, it was found that the accuracy of the two combined models was still much higher than the previous model, and could be improved to 94.1% (combining BAP and iSQRT-COV) and 93.9% (combining BAP and DFL). Both of these results are superior or superior to other single weakly supervised models.

As shown in FIG. 10, each model has different preferences in identifying different analog classes. However, the fusion model may balance the strengths and weaknesses of each component model, thereby achieving better local performance in the same category. For example, in category 70, labeled "chevrolet express 2007", the accuracy of iSQRT-COV is 58%, the accuracy of BAP is 51.4%, and the accuracy of DFL is only 40.0%, with a lower probability or even lower than 50% of random guesses. The accuracy of the 70 th class is improved to 69.6% by a fusion algorithm combining iteration and gradient back propagation optimization. Fig. 10 shows similar results by analyzing category 24 car named "auditt RS coupon 2012". As shown in FIG. 11, DFL incorrectly identifies 9 samples as other classes, such as Tesla and BMW types. BAP and iSTRT-COV also have similar poor performance in Audi prediction, with 5 false images (outlined in bold boxes) and 4 representations. The result shows that the model of the invention can integrate the extraction function and better classify the automobile types with fine granularity.

Example 3

Comparable experiments were used in the present invention to demonstrate the classification results for the stanford dogs in table 4. R-CNN is one of the early works in this area. It extracts traditional features for different parts of the image and performs supervised alignment of the targets. The accuracy of the method on the data set is only 79.8% of precision, which shows that the element extraction process is important for fine-grained classification. RA-CNN designs an unsupervised part model discovery method by selecting the salient part, and the accuracy rate reaches 87.3%. Currently, GMNet achieves the best results with an accuracy of 88.1%. While the three component models consisting of DFL, BAP and iSTRT-COV, the expected effects were 84.9%, 87.5% and 88.1%. On the data set, the result of the invention is superior to all the most advanced strong supervised learning and weak supervised learning methods, and the accuracy rate reaches 89.6 percent and is improved by at least 1.5 percent.

TABLE 4

FIG. 12 shows the identification curves of PFDM-Net and its component models for each type. The red curve indicates that the model of the invention is relatively stable without much fluctuation, indicating a better ability to distinguish between subclasses. Although the fusion model performed well on the whole of the stanford dogs with more training data, the results did not match the above-mentioned classes of databases. This is because the number of images of each type is very unbalanced, and the biological morphology of pet dogs varies greatly and varies with the growth cycle. Worse yet, the background of this data set is more complex, filling up with vehicles, people and commodities, which expands the variation within classes, making fine-grained identification more difficult.

FIG. 13 illustrates category 1 in the Stanford dog dataset, a generic name called Giigler. For 20 pictures chosen at random, the model of the invention outperformed the other methods and only two errors, indicated by red boxes, occurred. A dog with a black and white pattern on the wrong picture is clearly different from the other samples. In another false image, the dog is hidden from view by the owner, which renders all networks useless. These special cases often occur in the Stanford dog dataset, which limits further improvement of the model in local and global accuracy.

And (4) analyzing results: the method can be obtained through the

embodiments

1,2 and 3, the multi-level feature extraction is adopted, and the algorithm for performing probability fusion on the features is really innovative and practical, the accuracy of the method in CUB-200 plus 2011, Stanford Dogs (Stanford Dogs) is as high as 91.2%, and the effectiveness of the method in Stanford Dogs is as high as 89.6%, 95.2% and 89.6%, which are superior to the most advanced fine-grained visual classification method. The invention can effectively improve the performance of a fine-grained vision classification task and has certain research significance and engineering application value.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A fine-grained image identification method based on multi-grained local feature soft association aggregation is characterized by comprising the following steps:

2. The method according to claim 1, wherein the pre-processing of the enhanced data is performed on the image to be recognized, and the obtaining of the enhanced image comprises:

3. The method according to claim 1, wherein the step of performing feature extraction on the enhanced image to obtain coarse-grained features, fine-grained features and medium-grained features respectively comprises:

constructing a backbone network;

4. The method of claim 3, wherein the backbone network comprises one input module, four CSP Stage modules, and one pooling layer;

5. The method of claim 3, wherein inputting the multidimensional vector to a coarse-grained feature extractor to obtain coarse-grained features comprises:

6. The method of claim 3, wherein inputting the multidimensional vector to a fine-grained feature extractor to obtain fine-grained features comprises:

7. The method of claim 3, wherein inputting the multidimensional vector to a medium-granularity feature extractor to obtain medium-granularity features comprises:

8. An apparatus of fine-grained image recognition based on multi-grained local feature soft association aggregation, the apparatus comprising a processor configured with processor-executable operating instructions to perform the method steps of any one of claims 1 to 7.

9. A computer-readable storage medium, characterized in that the computer-readable storage medium comprises a computer program which, when run on an electronic device, causes the electronic device to carry out the method steps of any of claims 1 to 7.

10. A chip, characterized in that the chip is coupled with a memory for executing a computer program stored in the memory for performing the method steps of any of claims 1 to 7.