CN114494786A

CN114494786A - Fine-grained image classification method based on multilayer coordination convolutional neural network

Info

Publication number: CN114494786A
Application number: CN202210141309.7A
Authority: CN
Inventors: 李鸿健; 何明轩; 段小林; 何旭; 罗炼
Original assignee: Chongqing University of Post and Telecommunications
Current assignee: Chongqing University of Post and Telecommunications
Priority date: 2022-02-16
Filing date: 2022-02-16
Publication date: 2022-05-13

Abstract

The invention relates to the field of deep learning and the field of image classification, in particular to a fine-grained image classification method based on a multilayer coordination convolutional neural network. The invention realizes the positioning and the feature extraction of key areas on a fine-grained image classification task, utilizes the multi-scale cutting and filling of images to train different convolution layers, thereby fusing the characteristics of a shallow network and a deep network, simultaneously destroying the integrity of pictures by disturbing local areas, and recording the original position by position coding to reduce the noise caused by disturbing pictures.

Description

Fine-grained image classification method based on multilayer coordination convolutional neural network

Technical Field

The invention relates to the field of deep learning and the field of image classification, in particular to a fine-grained image classification method based on a multilayer coordination convolutional neural network.

Background

Traditional classification tasks generally refer to rough classifications, such as cats and dogs. The classification task is easier due to their larger distinguishing features. While fine-grained classification is a subtask for image classification, mainly studies subclasses of the same class, for example, taking dogs as an example, husky and alaska are similar in appearance due to similar blood systems, and the difference is limited to a few local areas, such as the color of eyes, the hair color shape of forehead, and the like. The fine-grained identification focuses on local features, and for one picture, the local features are many, so that the problem of learning useful features from the many local features is a difficult problem. Especially in the case of a few pictures, the wrong features are easily learned, which is an over-fit to the training set. Most of recent work is centered on attention mechanism, so that the network can pay more attention to local key areas, and the classification accuracy is improved.

The fine-grained classification method mainly comprises two methods: one is a classification model based on strong supervision, which needs to use additional information such as manually labeled object labeling frames and local Part labeling frames besides class labels, for example, weix S et al uses manually labeled Part labeling points and classification labels through a Mask-CNN algorithm in a local positioning module in a training process, and Part R-CNN and pos normalized CNN use additional manually labeled information such as object labeling and Part labeling points in a domain-distinguishing extraction process. A large number of manual labeling is very expensive, so that classification methods based on weakly supervised learning are the mainstream trend. The other is a classification model based on weak supervision, which only relies on class labels and does not use additional part labeling information. For example, Ge et al proposed a classification model in 2019. The model uses CAM to obtain a key area of the picture through a classification model, then uses CRF correction and a target detection method to generate more appropriate target propofol (a plurality of propofol) in an iterative manner, then uses a supplementary model selection algorithm to select the most appropriate propofol from the propofol, and uses LSTM to extract and classify the features, wherein the label only has a classification label of the picture, so the model is a weak supervision algorithm. In order to learn key regions and features, in addition to using a standard underlying classification network, Chen et al propose a Destruction and reconstruction Learning (DCL) method to improve fine-grained image recognition accuracy. Although the extracted features have certain identification capability, how to effectively extract the features of the key area under the condition of only a category label is challenging.

Disclosure of Invention

Based on the problems in the prior art, the invention provides a fine-grained image classification method based on a multilayer coordination convolutional neural network. The technical scheme of the invention is as follows:

acquiring an image data set, and preprocessing an image to be classified in the image data set;

extracting image features of the image to be classified by adopting a convolutional neural network, inputting the image features into a positioning subnet, and acquiring a positioning key region by utilizing the positioning subnet to obtain a key region subgraph of the image to be classified;

carrying out multi-scale cutting and filling on the subgraph in the key region, and randomly exchanging each image block to obtain a plurality of groups of cutting and filling subgraphs with different scales;

carrying out position coding on the image blocks in each group of the cut filler subgraphs, and connecting the corresponding position coding feature graphs with the cut filler subgraphs according to the channels;

sequentially inputting the cut filler sub-graphs of different scales into a first classification model of a preset training sub-network respectively to obtain probability values of corresponding classes;

and inputting the probability values of the corresponding classes of the cut filler subgraphs with different scales into a second classification model of a preset training subnet, and obtaining a fine-grained classification result of the image to be classified after weighted average.

Further, the obtaining of the key region sub-image of the image to be classified by using the positioning sub-network includes:

summing the extracted image features according to channels, and performing bilinear upsampling on the obtained summed feature graph to obtain a significant graph with the same size as the image to be classified;

selecting the saliency map according to a self-adaptive threshold value to obtain a mask matrix, and mapping the mask matrix to a corresponding image to be classified to obtain a concerned part;

and performing bilinear interpolation upsampling on the concerned part to obtain a concerned image with the same size as the image to be classified, namely a key region subgraph of the image to be classified.

Further, the selecting the saliency map according to the adaptive threshold includes calculating the adaptive threshold according to an average value and a hyper-parameter of the saliency map, and determining a mask matrix according to a size relationship between the adaptive threshold and a corresponding matrix element in the saliency map, that is, when the adaptive threshold is larger than the corresponding matrix element in the saliency map, the corresponding matrix element in the mask matrix is 1, otherwise, the corresponding matrix element is 0.

Further, the performing multi-scale cutting and filling on the sub-images of the key region, and randomly exchanging each image block to obtain a plurality of groups of cutting and filling sub-images with different scales includes:

respectively cutting the key region subgraph into different NxN sub-images;

filling 0 into each sub-image under each cutting scale to obtain the filled sub-image IP_nThe filled sub-image set is I_pad{IP_n|0≤n<N²}；

Randomly splicing the sub-images which are filled with 0 in each cutting scale into a new image to be classified according to the spatial position of the image to be classified;

the new image to be classified under each cutting scale is downsampled to respectively obtain cutting filler sub-images with the same size as the original image to be classified:

wherein, IP_nThe image after the nth sub-image is filled is represented, N represents the sub-image number of a key area sub-image, namely the attention image, and N represents the side length of an image block; i is_pad{. denotes the set of filled sub-images.

Further, the position coding of the image blocks in each group of cut filler subgraphs comprises inputting the sequence of the image blocks into a sine and cosine position coding function PSC, respectively, to obtain the position codes of the corresponding image blocks; and according to the filling granularity and the cutting granularity of the cutting filling subgraph, carrying out quantization processing on the position codes to obtain a position code characteristic graph.

Further, the sequentially inputting the cut filler sub-images with different scales into a preset first classification model respectively to obtain probability values of corresponding categories comprises inputting each cut filler sub-image into different residual convolution modules respectively to obtain weighted feature maps with different levels; all the weighted feature maps except the last stage are downsampled, and the downsampled weighted feature maps and the next-stage weighted feature map are spliced and fused; and taking the weighted feature map of the first level and the feature map after other splicing fusion as multi-level feature maps, and obtaining probability values of corresponding categories through a linear layer and a softmax classifier respectively.

Further, the step of inputting the probability values of the corresponding categories of the cut filler subgraphs of different scales into a preset second classification model, and after weighted averaging, obtaining a fine-grained classification result of the image to be classified comprises inputting key region subgraphs into a residual convolution module, performing feature fusion on feature graphs of the key region subgraphs and multi-level feature graphs respectively to form multi-level feature vectors, and performing layered bilinear pooling on every two feature vectors to obtain multi-level feature fusion feature vectors; and respectively passing the feature vectors through a linear layer and a softmax classifier to obtain probability values of corresponding classes.

The preset first classification model and the preset second classification model are obtained after being trained respectively, the training process comprises the steps of calculating an output class value and a loss value of a real label through a cross entropy loss function, calculating the gradient of the output class value and the loss value of the real label, carrying out back propagation, and continuously updating parameters through a gradient descent method.

The invention has the following advantages and beneficial effects:

1. the positioning subnet is used for outputting a bounding box of a target, the training subnet is used for training network parameters, and the positioning subnet and the training subnet share the network parameters, so that the size of an input image is consistent with that of a convolution feature extractor, and the image processing time can be reduced;

3. according to the invention, when the whole model is trained, different scale pictures can be used for training different convolution layers, so that the robustness of the network model can be improved, and meanwhile, the resnet50 is independently adapted to each scale picture, so that the time cost can be saved;

4. the image data set adopted in the training process only needs to classify the labels, and does not need manually labeled bounding boxes, so that the calculation consumption can be effectively reduced.

5. In the whole model of the invention, in the training process, the original data is input, the result is output, the middle neural network is integrated from the input end to the output end, the independent training of middle parameters is saved, and the problem of module connection dependence is prevented, so the model is an end-to-end model.

6. The method is based on multi-scale cutting and filling, the integrity is damaged by disturbing the picture pixels, meanwhile, the position coding is added, the original pixel position information of the pixels is retained, and finally the multilayer convolution network characteristics are fused, so that the precision is improved compared with the existing method based on characteristic fusion.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a diagram of a network architecture of the present invention;

FIG. 3 is a feature fusion diagram of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

The invention aims to extract local detail features and overall semantic features of fine-grained images on a classification task by utilizing different layers of a convolutional neural network, different receptive fields and different training scales to cut and fill the images, and finally different features are fused to improve classification accuracy.

As shown in fig. 1, a fine-grained classification method based on convolutional neural network multi-layer coordination according to the present invention includes the following steps:

s1, acquiring an image data set, and preprocessing an image to be classified in the image data set;

in the embodiment of the present invention, the image dataset may be known training set data and test set data, or may be an unknown image dataset to be classified, if the image dataset is known training set data, a corresponding class label needs to be labeled to an image in the training set, for example, the class label of the image a in the training set is dog-husky, and if the image dataset is an unknown image to be classified or test set image, the class of the image needs to be predicted through a subsequent neural network model.

In the embodiment of the present invention, the preprocessing of the image to be classified may include conventional image preprocessing methods such as clipping, spatial transformation, denoising, image enhancement, and the like, and the present invention is not specifically limited thereto, and those skilled in the art may select a corresponding method according to actual situations.

S2, extracting the image features of the image to be classified by adopting a convolutional neural network, and acquiring a positioning key region by utilizing a positioning subnet to obtain a key region sub-image of the image to be classified;

in the embodiment of the invention, an original image to be classified is input into a convolutional neural network, and feature extraction can be performed on the last convolutional layer or the last two convolutional layers, wherein the convolutional neural network can be a conventional convolutional neural network, a cyclic convolutional neural network, a residual convolutional neural network and the like, and the Resnet50 model is preferably used for extracting the image features of the image to be classified.

In the embodiment of the present invention, the obtaining of the key region sub-image of the image to be classified by using the positioning sub-network includes:

s21, summing the extracted image features according to the channels, and performing bilinear upsampling on the obtained summed feature map to obtain a saliency map with the same size as the image to be classified;

in the embodiment of the present invention, because the convolutional neural network can extract the image features of the image to be classified, the image features in this embodiment can be input into an LCU (Location and sampling) model, the model locates, crops and upsamples the image into the size of the original image, so as to force the network to pay more attention to a local area, the model can be used as a locating subnet in this embodiment, and the intersection of the extracted final two layers of feature maps can be taken and summed according to a channel, so that the saliency map is more accurate. The summation formula is expressed as:

the formula for obtaining the saliency map is:

S(x,y)＝g(M(x,y))

wherein D and Z respectively represent the number of characteristic diagram channels of the last two layers, F_z(x, y) is the z-th feature map of the penultimate layer feature, G_n(x, y) denotes the nth feature map of the last layer feature, M (x, y) is F_z(x, y) and G_n(x, y) channel-wise summed values, g (-) is a bilinear interpolation of M (x, y), S (x, y) is represented as a saliency map, x represents a row of a two-dimensional matrix, and y represents a column of the two-dimensional matrix.

S22, selecting the saliency map according to the self-adaptive threshold value to obtain a mask matrix, and mapping the mask matrix to the corresponding image to be classified to obtain a concerned part;

in the embodiment of the present invention, a hyper-parameter α may be set first, and an adaptive threshold θ may be obtained according to the average value of the saliency map and the hyper-parameter, where the adaptive threshold θ has a formula as follows:

θ＝(1-α)·avg(S(x,y))

where α represents a hyper-parameter of attention to the key region, and avg (S (x, y)) represents an average value on the saliency map S (x, y).

In the embodiment of the present invention, the above attention refers to the degree of attention to the key region, and the larger the value is, the larger the degree of enlarging the detected object can be understood to be, and therefore 0< α < 1.

In this embodiment, the saliency map S (x, y) is selected by using the obtained adaptive threshold θ to obtain a Mask matrix Mask (i, j), and the Mask matrix is mapped to the original image to be classified, so that the attention portion can be obtained. The calculation formula of the Mask matrix Mask (i, j) is as follows:

wherein i represents the row coordinate of the Mask matrix, j represents the column coordinate of the Mask matrix, when the adaptive threshold θ is greater than the corresponding matrix element S (i, j) in the saliency map, the corresponding matrix element Mask (i, j) in the Mask matrix takes 1, otherwise takes 0.

And S23, performing bilinear interpolation upsampling on the concerned part to obtain a concerned image with the same size as the image to be classified, namely a key region subgraph of the image to be classified.

In the embodiment of the invention, bilinear interpolation upsampling is used for the concerned part determined by the mask matrix to obtain the concerned image with the same size as the input image to be classified; and cutting the attention image into N × N sub-images, wherein the attention image cutting formula is as follows:

where I denotes a line coordinate of the mask matrix, j denotes a column coordinate of the mask matrix, ψ (I) denotes an image of interest after being sampled on the part of interest, N is a constant, f_c(. H) shows a method of cutting an image ψ (I) into sub-images of N x N, I_subRepresented as a collection of sub-images after cutting, where IS_nDenoted as the nth sub-image.

The positioning key area is obtained by utilizing the positioning sub-network, the salient map is obtained, the salient map is utilized to cut the target, the influence of the network on environmental factors except the resolution target can be reduced, and meanwhile, fine-grained classification often distinguishes parts with very detailed points in the image, so that the network is forced to pay attention to the detailed parts through an up-sampling process, and the effect of improving the resolution accuracy is achieved.

S3, performing multi-scale cutting and filling on the key region subgraph, and randomly exchanging each image block to obtain a plurality of groups of cutting and filling subgraphs with different scales;

in the embodiment of the invention, the key region subgraph, namely the key image, needs to be cut and filled respectively by adopting different scales, and under each scale, the image blocks corresponding to the scale need to be exchanged randomly, so that the cut and filled subgraphs with different scales are obtained.

Specifically, in some embodiments of the present invention, four scales are taken as an example, that is, three scales are required in addition to the scale {1} of the original image to be classified, which is assumed to be {2,4,8}, then the cut-and-fill process of the present invention may include:

s31: cutting the image of interest into N × N sub-images; wherein N ═ {2,4,8 };

s32: filling 0 into each sub-image under each cutting scale to obtain the filled sub-image IP_nThe filled sub-image set is I_pad{IP_n|0≤n<N²}；

For example, taking N-4 as an example, the image of interest may be cut into 4 × 4-16 sub-images, each sub-image is actually an image block, and the 16 sub-images are respectively filled with 0, that is, 0 is filled around each sub-image to separate different image blocks, so as to focus on fine granularity information of the image blocks.

S33: randomly splicing the sub-images which are filled with 0 in each cutting scale into a new image to be classified according to the spatial position of the image to be classified;

in this embodiment, according to the spatial position of the original image to be classified, the sub-images after 0 padding under each cutting scale may be randomly spliced into a new image to be classified, similarly, taking N equal to 4 as an example, the 16 image blocks are randomly scrambled, and the image to be classified is re-spliced according to the spatial position (4 × 4) of the original image to be classified, it can be found that, since the sub-images are padded with 0, each image block is expanded, and the combined size of the 16 image blocks will be larger than that of the original image to be classified.

S34: carrying out down-sampling on the new image to be classified under each cutting scale to respectively obtain cutting filler sub-images with the same size as the original image to be classified;

and downsampling the newly spliced images to be classified, removing redundant combined sizes and enabling the sizes of the new images to be classified to be consistent with those of the original images to be classified.

In the embodiment of the present invention, the focused image is segmented into N × N sub-images, and the focused image segmentation formula is:

In the embodiment of the invention, the sub-image set I after being cut_subFilling 0 of the size P of each sub-image, splicing the filled images according to the original space position, and sampling to the same size of the original image; the formula for 0 padding a sub-image is:

IP_n＝f_p(IS_n,P)

the formula of image stitching is:

wherein f is_p(. The) IS expressed as a sub-image IS_nGo on to sizeIs 0 filling of P, I_pad{IP_n|0≤n<N²N ═ 2,4,8, … } } denotes a set of padded sub-images, IP_nRepresenting the image filled with the nth sub-image, f_sShown as stitching all the filled sub-images into a new image of the same size as the original image, named as the filled image.

S4, carrying out position coding on the image blocks in each group of cut filler subgraphs, and connecting the corresponding position coding feature graphs with the cut filler subgraphs according to channels;

in the embodiment of the invention, the position coding of the image blocks in each group of the cut filler subgraphs comprises the steps of respectively inputting the sequence of the image blocks into a sine and cosine position coding function PSC to obtain the position codes of the corresponding image blocks; and according to the filling granularity and the cutting granularity of the cutting filling subgraph, carrying out quantization processing on the position codes to obtain a position code characteristic graph.

The step of embedding the position code comprises:

s41, respectively inputting the sequence of the patch into a sine and cosine position coding function PSC to obtain a position code corresponding to the patch;

P_em＝PSC(idx)

wherein the PSC can be expressed as:

where P denotes position coding, dim denotes the current input picture dimension, ω denotes angular frequency, and t denotes the position sequence. Where the angular frequency ω can be expressed as:

s42, generating a position coding feature map; wherein the obtained position-coding feature map RE can be expressed as:

wherein RE_xy[i][j]Representing the position code at pixel coordinate (i, j) in the (x, y) th image block; x is the abscissa of the image block in the cut-and-fill subgraph and y is the ordinate of the image block in the cut-and-fill subgraph; (x, y) representing the arrangement position of the image block in the cut filler subgraph, wherein i is the abscissa of a pixel point of the image block, j is the ordinate of the pixel point of the image block, and (i, j) represents the arrangement position of the pixel point in the image block; p_xyPosition code representing the (x, y) th image block, cut is the cut granularity, and pad is the fill granularity, where x, y satisfy: {0 is less than or equal to x, y is less than or equal to cut-1, x, y belongs to N^*N is the side length of each image block patch, which can be expressed as:

where width is the side length of the picture.

S43, connecting the position coding pattern RE according to the channel and the cutting filling sub-pattern to obtain a characteristic pattern with position coding:

R＝concat_c(RE,Rm)

wherein, concat_cRepresenting the characteristic connection, m represents the scale, and m is 1,2,4,8 ….

The invention can destroy the integrity of the image by disturbing the pixel blocks of the picture and passing through the pixel blocks with 0 interval, reduces the connection between the pixel blocks and forces the network to pay more attention to the local area rather than the whole, but the disturbance of the pixel blocks can cause the distortion of the original image, so the pixel level position coding is embedded, and the original position information is always kept in the process of picture retransmission.

S5, sequentially inputting the cut filler sub-graphs of different scales into a first classification model of a preset training sub-network respectively to obtain probability values of corresponding classes;

in the embodiment of the invention, the first classification model is mainly used for classifying the cut filler subgraphs with different scales, the first classification model is mainly a plurality of residual convolution modules, and each cut filler subgraph in the invention can be respectively input into different residual convolution modules to obtain weighted feature graphs with different levels; all the weighted feature maps except the last stage are downsampled, and the downsampled weighted feature maps and the next-stage weighted feature map are spliced and fused; and taking the weighted feature map of the first level and the feature map after other splicing fusion as multi-level feature maps, and obtaining probability values of corresponding categories through a linear layer and a softmax classifier respectively.

In a preferred embodiment of the present invention, the first classification model of this embodiment adopts a resnet50_ conv _ x model, where resnet50_ conv _ x includes resnet50_ conv _3, resnet50_ conv _ 4; respectively representing the first three volume blocks, the first four volume blocks of the resnet50 network.

Taking the first classification model in the above preferred embodiment as an example, in this embodiment, the process of respectively inputting the cut filler subgraphs of different scales into the preset first classification model to obtain the probability values of the corresponding classes may include:

s51 cutting and filling subgraph R in S3₈Input to resnet50_ conv _3 to obtain weighted feature map F₁Cutting the filling subgraph R in S3₄Input to resnet50_ conv _4 to obtain weighted feature map F₂Cutting the filling subgraph R in S3₂Input to resnet50_ conv _4 to obtain weighted feature map F₃；

S52, reducing the sizes of the feature graphs F1 and F2 by half through a convolution operation with a convolution kernel of 3 x 3 and a step size of 2 x 2 to obtain down-sampled feature graphs F1_ ds and F2_ ds;

s53, keeping the F1 feature unchanged, splicing and fusing the feature maps F1_ ds and F2 and F2_ ds and F3 through a channel to obtain fused feature maps F1_ F2 and F2_ F3;

s54, converting the feature vector F in S43₁,F₁_F₂,F₂_F₃And respectively obtaining the probability value of the corresponding category through the linear layer and the softmax classifier.

In order to meet the requirement of value range, a Relu activation function formula is required to be used in the last layer of the fully-connected network:

Relu＝max(0,x)

in the embodiment of the present invention, the preset first classification model may be a trained first classification model or a trained first classification model, and the preset first classification model may be obtained after training, and the training process includes calculating an output class value and a loss value of a real label through a cross entropy loss function, calculating a gradient thereof, performing back propagation, and continuously updating parameters through a gradient descent method.

Wherein, should use the gradient descent algorithm to train, the loss function uses the logistic regression loss function formula:

n is the number of pictures in a batch, y is the correct label,

r is the Relu function, which is the output of the network.

S6, inputting the probability values of the corresponding classes of the cut filler subgraphs with different scales into a second classification model of a preset training subnet, and obtaining a fine-grained classification result of the image to be classified after weighted averaging.

In the embodiment of the invention, key region subgraphs are input into a residual convolution module, the feature graphs of the key region subgraphs are respectively subjected to feature fusion with multi-level feature graphs to form multi-level feature vectors, and the feature vectors are subjected to layered bilinear pooling pairwise to obtain multi-level feature fusion feature vectors; and respectively passing the feature vectors through a linear layer and a softmax classifier to obtain probability values of corresponding classes.

In a preferred embodiment of the present invention, the second classification model of this embodiment adopts a resnet50_ conv _ x model, where resnet50_ conv _ x includes resnet50_ conv _ 5; representing the first five volume blocks of the resnet50 network.

Taking the second classification model in the above preferred embodiment as an example, in this embodiment, the process of inputting the probability values of the corresponding classes of the cut filler subgraphs with different scales into the preset second classification model, and obtaining the fine-grained classification result of the image to be classified after weighted averaging may include:

s61, inputting the original image R in S3 into resnet50_ conv _5, and obtaining a plurality of groups of feature vectors F through feature fusion_a,F_b,F_c；

S62, setting parameters alpha, beta and gamma as F respectively_a、F_b、F_cThe characteristic weight parameter of (1). Performing layered bilinear pooling on the feature vectors pairwise to obtain a multilayer feature fusion feature vector F_zbp；

Wherein the layered bilinear pooling structure is:

Z_HBP＝HBP(F_a,F_b,F_c,…)

wherein HBP (F)_a,F_b,F_c) The calculation method is as follows:

HBP(F_a,F_b,F_c)＝P^Tconcat(α*U^TF_a*V^TF_b,β*U^TF_a*S^TF_c,γ*V^TF_b*S^TF_c)

where P is the classification matrix and U, V, S are the convolution features F_a,F_b,F_cA projection matrix of (a), beta, gamma are respectively F_a、F_b、F_cThe characteristic weight parameter of (1).

S63, converting the feature vector F_zbpAnd respectively obtaining the probability value of the corresponding category through the linear layer and the softmax classifier.

Relu＝max(0,x)

n is the number of pictures in a batch, y is the correct label,

r is the Relu function, which is the output of the network.

In a preferred embodiment of the present invention, the same resnet50_ conv _ x model may be used for the first classification model and the second classification model of the present invention, and the two classification models may share parameters, and it can be found that the model accuracy of the first classification model affects the model accuracy of the second classification model, that is, the classification results of the cut filler subgraphs of different scales may affect the fine-grained classification results of the image to be classified, so the present invention may also use the scale-related weights for back propagation, which not only can learn the scale-related fine-grained information, but also can greatly reduce the problems of uneven image data quality, too small data volume, and the like encountered in image classification.

Fig. 2 is a network structure diagram of the present invention, and as shown in fig. 2, in the embodiment of the present invention, an original picture is input into a network in the first step to obtain a feature map. And secondly, inputting the characteristic diagram into an LCU module, positioning key areas of the image, cutting and upsampling. And thirdly, cutting and filling the picture in multiple scales and carrying out position coding. And fourthly, respectively embedding position codes into the cut and filled pictures. And fifthly, respectively inputting the multi-scale correspondence into the convolutional layers for training to obtain loss values, and reversely propagating and updating the network parameters. And sixthly, fusing the feature vectors of different layers, inputting a softmax layer to obtain a predicted value, comparing the predicted value with a real class label to calculate a loss value, and reversely propagating to continuously update the network parameters. And finally, in the testing process, only an original image needs to be input into the trained network to obtain a characteristic vector, and a softmax layer is input to obtain a final prediction category.

Fig. 3 shows a feature fusion map of the present invention, and as shown in fig. 3, the present invention first obtains a fused feature vector having a size of 2048 × 28 × 28 by fusing the feature vectors output from P1 and stage4 with respect to a channel of stage3 by a convolution layer having a convolution kernel of 3 × 3 and a length and width of 56 as 1024 × 28 × 28 as the output P1 of stage3, and further obtains a fused feature vector having a size of 2048 × 28 × 14 by fusing the feature vectors output from P1 and stage4 with respect to the channel, and further obtains a fused feature vector having a size of 3072 × 14 × 14 by a convolution layer having a convolution kernel of 3 × 3 and a step 2 as the output P2 of stage4, and finally obtains a fused feature vector having a size of 3072 × 14 × 14 by fusing the feature vectors output from P2 and stage5 with respect to a convolution kernel of 3 × 3 and a step 2, and then obtains a fused feature map of 2048 × 7 × 7 by fusing the feature map as the output P5 with a convolution kernel and a step 2 as the downsampled output P3.

In the description of the present invention, it is to be understood that the terms "coaxial", "bottom", "one end", "top", "middle", "other end", "upper", "one side", "top", "inner", "outer", "front", "center", "both ends", and the like, indicate orientations or positional relationships based on those shown in the drawings, and are only for convenience of description and simplicity of description, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus, are not to be construed as limiting the present invention.

In the present invention, unless otherwise expressly stated or limited, the terms "mounted," "disposed," "connected," "fixed," "rotated," and the like are to be construed broadly, e.g., as meaning fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; the terms may be directly connected or indirectly connected through an intermediate, and may be communication between two elements or interaction relationship between two elements, unless otherwise specifically limited, and the specific meaning of the terms in the present invention will be understood by those skilled in the art according to specific situations.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A fine-grained image classification method based on a multilayer coordinated convolutional neural network is characterized by comprising the following steps:

extracting image features of the image to be classified by adopting a convolutional neural network, and acquiring a positioning key region by utilizing a positioning subnet to obtain a key region sub-image of the image to be classified;

2. The fine-grained image classification method based on the multilayer coordination convolutional neural network as claimed in claim 1, wherein the step of obtaining a key region to be classified by using a positioning subnet to obtain a key region subgraph of the image to be classified comprises the following steps:

3. The fine-grained image classification method based on the multilayer coordinated convolutional neural network as claimed in claim 2, wherein the selecting the saliency map according to the adaptive threshold comprises calculating the adaptive threshold according to the average value and the hyper-parameter of the saliency map, and determining a mask matrix according to the size relationship between the adaptive threshold and the corresponding matrix element in the saliency map, namely when the adaptive threshold is larger than the corresponding matrix element in the saliency map, the corresponding matrix element in the mask matrix is taken as 1, otherwise, the corresponding matrix element is taken as 0.

4. The fine-grained image classification method based on the multilayer coordinated convolutional neural network according to claim 3, wherein the adaptive threshold value is calculated by the formula:

θ＝(1-α)·avg(S(x,y))

wherein θ represents an adaptive threshold; α represents a hyper-parameter of attention to the critical area, avg (S (x, y)) represents an average over the saliency map S (x, y), and (x, y) represents a matrix element.

5. The fine-grained image classification method based on the multilayer coordinated convolutional neural network according to claim 1, wherein the step of performing multi-scale cut filling on the sub-images of the key region and randomly exchanging each image block to obtain a plurality of groups of cut filled sub-images with different scales comprises the following steps:

respectively cutting the key region subgraph into different NxN sub-images;

for each cutting scaleFilling 0 in the sub-image to obtain the sub-image IP after filling_nThe filled sub-image set is I_pad{IP_n|0≤n<N²}；

carrying out down-sampling on the new image to be classified under each cutting scale to respectively obtain cutting filler sub-images with the same size as the original image to be classified;

6. The fine-grained image classification method based on the multi-layer coordinated convolutional neural network as claimed in claim 1, wherein the position coding of the image blocks in each group of cut filler subgraphs comprises inputting the sequence of the image blocks into a sine and cosine position coding function (PSC) respectively to obtain the position codes of the corresponding image blocks; according to the filling granularity and the cutting granularity of the cutting filling subgraph, carrying out quantization processing on the position codes to obtain a position code characteristic graph; wherein, the obtained position coding characteristic diagram RE is expressed as:

wherein RE_xy[i][j]Representing the position code at pixel coordinate (i, j) in the (x, y) th image block; x is the abscissa of the image block in the cut-and-fill subgraph and y is the ordinate of the image block in the cut-and-fill subgraph; i is the abscissa of the pixel of the image block, j is the ordinate of the pixel of the image block, P_xyRepresenting the position code of the (x, y) th image block, cut is the cutting granularity, pad is the filling granularity, and {0 ≦ x, y ≦ cut-1, x, y ∈ N^*N is the side length of each image block, and is represented as:

width is the side length of the picture.

7. The fine-grained image classification method based on the multilayer coordinated convolutional neural network as claimed in claim 1, wherein the cut filler sub-graphs of different scales are sequentially and respectively input into a first classification model of a preset training sub-network, and the obtaining of the probability value of the corresponding class comprises the steps that each cut filler sub-graph is respectively input into different residual convolution modules, so that weighted feature graphs of different levels are obtained; all the weighted feature maps except the last stage are downsampled, and the downsampled weighted feature maps and the next-stage weighted feature map are spliced and fused; and taking the weighted feature map of the first level and other spliced and fused feature maps as multi-level feature maps, and obtaining probability values of corresponding categories through a linear layer and a softmax classifier respectively.

8. The fine-grained image classification method based on the multilayer coordinated convolutional neural network as claimed in claim 7, wherein the step of inputting the probability values of the corresponding classes of the cut filler subgraphs of different scales into a second classification model of a preset training subnet, and obtaining the fine-grained classification result of the image to be classified after weighted averaging comprises the steps of inputting key region subgraphs into a residual convolution module, performing feature fusion on feature graphs of the key region subgraphs and multilevel feature graphs respectively to form multilevel feature vectors, and performing hierarchical bilinear pooling on every two feature vectors to obtain multilayer feature fusion feature vectors; and respectively passing the feature vectors through a linear layer and a softmax classifier to obtain probability values of corresponding classes.

9. The fine-grained image classification method based on the multilayer coordinated convolutional neural network as claimed in claim 1, wherein the preset first classification model and the preset second classification model are obtained after being trained respectively, and the training process comprises calculating an output class value and a loss value of a real label through a cross entropy loss function, calculating the gradient of the output class value and the loss value of the real label, performing back propagation, and continuously updating parameters through a gradient descent method.