CN111652236A

CN111652236A - Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene

Info

Publication number: CN111652236A
Application number: CN202010505152.2A
Authority: CN
Inventors: 李春国; 刘杨; 杨哲; 胡健; 杨绿溪; 徐琴珍
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-04-21
Filing date: 2020-06-05
Publication date: 2020-09-11
Anticipated expiration: 2040-06-05
Also published as: CN111652236B

Abstract

According to the method, multilayer aggregate packet convolution is used for replacing conventional convolution to construct a novel residual error module, and the novel residual error module is directly embedded into a deep residual error network framework to achieve light weight of a basic network. Then, modeling is carried out on interaction among the features through efficient calculation and low-rank approximate polynomial kernel pooling, the feature description vector dimension is compressed, storage occupation and calculation cost of a classification full-connection layer are reduced, meanwhile, the pooling scheme enables the linear classifier to have the discrimination capacity equivalent to that of a high-order polynomial kernel classifier, and identification precision is remarkably improved. Finally, combining feature diversity by adopting a cross-layer feature interaction network framework, enhancing feature learning and expression capability and reducing overfitting risk. The comprehensive performance of the light-weight fine-grained image recognition method of cross-layer feature interaction in the weak supervision scene in three aspects of recognition accuracy, calculation complexity and technical feasibility is at the leading level at present.

Description

Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene

Technical Field

The invention belongs to the field of computer vision, in particular to a method for identifying fine-grained images by using image-level label weak supervision information and combining low-rank approximate polynomial nuclear pooling and a cross-layer feature interaction network framework, and particularly relates to a light-weight fine-grained image identification method for cross-layer feature interaction in a weak supervision scene.

Background

With the rapid development of internet technology, the human society has advanced into the information age, and the total amount of data resources stored in various ways such as text, image, voice, video and the like in the network has exponentially increased. The image data is vivid and visual, is not limited by regions and languages, gradually becomes a mainstream information carrier, and has wide application prospect and practical research significance. Meanwhile, the parallel computing theory and the upgrading of hardware equipment promote the massive image processing to be possible, so that the research trend in the computer vision field including image recognition, target detection, semantic segmentation and the like is raised. Image identification is a fundamental research topic in the field of computer vision, and the main task is to preprocess an acquired image, extract characteristic information on the basis, and construct a classifier according to the characteristic information so as to judge a target class in the image. In conventional image recognition, the object classes to be recognized are usually coarse-grained, such as pedestrians, cats and dogs, vehicles, and the like. Such cross-species targets exhibit significant differences in appearance and no membership, and therefore are less difficult to identify. However, in many real applications, the target to be identified belongs to a fine-grained category, i.e. belongs to different subclasses under a certain coarse-grained category, such as different varieties of flowers, various types of automobiles, and the like. Compared with a coarse-grained image recognition task, the appearance similarity of the targets of different subclasses in the fine-grained data set is high, and the visual difference of the targets of the same subclass is obvious due to the factors such as postures, visual angles and shielding.

The deep learning image recognition technology based on the mass data and by means of the artificial neural network for autonomously learning the high-level semantic features of the image can describe image information from multiple angles and at multiple levels, has strong robustness, and draws wide attention in academic and industrial fields. At present, a large number of scholars construct a large number of deep learning models and apply the deep learning models to a fine-grained image recognition task to obtain a preliminary research result. According to the strength of the supervision information depended by the model in the training stage, the fine-grained image recognition technology based on deep learning can be further divided into strong supervision fine-grained image recognition and weak supervision fine-grained image recognition. The strong supervision fine-grained image recognition algorithm is no longer difficult to realize high-precision recognition by introducing additional supervision information and assisting a complex detection model. However, the manually labeled supervision information is expensive to obtain, and the application of the technology in a large-scale real scene is limited. Meanwhile, the target category can be accurately judged only by image-level labels in the model training stage by the weak supervision fine-grained image recognition, the practicability and the expandability are strong, and the method becomes the mainstream trend of fine-grained image recognition research in the current stage. The weak supervision Biliner CNN extracts image features by using two mutually independent basic networks and captures pairwise correlation among feature channels through matrix outer products to obtain second-order statistical information of convolution features, so that a linear classifier has the same discrimination capability as a second-order polynomial kernel classifier (see T.Lin, S.major.Biliner. Bilinear conditional Neural network for Fine-granular Visual Recognition, 2015.). Improved B-CNN performs root mean square normalization on the bilinear feature description matrix to compress the dynamic range of the feature values, and combines L₂The stabilization of the model is further improved by means such as regularization (details of the method)See t.lin.improved bililinear Pooling with CNNs, 2017.). Boost-CNN combines a plurality of Bilinear CNNs with weak classification capability in a Boosting mode by means of the idea of ensemble learning, and solves a least square function to determine a weight coefficient of each base learner so as to construct a strong classifier (see M.Mohammad. Boost conditional Neural Networks, 2016). CBP fits a second order polynomial kernel using two approximation algorithms, Random Maclaurin (RM) and Tensorssketch (TS), such that the 8192-dimensional TS feature has the same expressive power as the 262K-dimensional Bilinear feature (see y. Considering the phenomenon of information loss in the forward propagation process of the convolutional neural network, the Bilinear pooling is performed by the aid of top-layer convolutional activation of the deep neural network by the aid of the Bilinear CNN and various variant algorithms, but the semantics of all key regions of an image are not sufficiently described by the aid of features from a single convolutional layer, and the loss of discriminative information which is significant to fine-grained image identification can be caused by directly taking the features as reference features. In addition, the bilinear pooling utilizes the matrix outer product operation to capture the pairwise correlation among the feature channels, so that the identification accuracy is promoted to be remarkably improved, however, the operation causes the dimensionality of the feature description vector to be increased to 262K, and the parameter quantity and the calculation quantity of the full connection layer are linearly increased. Although the CBP can reduce the dimensionality of the feature description vector to a certain extent by fitting a second-order polynomial kernel function by using a low-dimensional random projection RM and TS algorithm, the operation time is greatly increased because the calculation process of the CBP involves Fourier transform.

In summary, for the task of weakly supervised fine-grained image recognition only using image-level label information, it is difficult for the existing method to realize high-precision recognition under the condition of low model parameter and computation amount, so a light-weight fine-grained image recognition method of cross-layer feature interaction with balanced recognition accuracy and computation complexity is needed.

Disclosure of Invention

In order to solve the above problems, the invention provides a light-weight fine-grained image recognition method for cross-layer feature interaction in a weak supervision scene, and the technical problem to be solved is that only an image-level label is used for constructing a fine-grained recognition model, so that the storage space and the calculation cost of the model are reduced while higher recognition accuracy is obtained, and the model is suitable for a large-scale real scene, and in order to achieve the purpose, the invention provides the light-weight fine-grained image recognition method for cross-layer feature interaction in the weak supervision scene, which comprises the following steps:

(1) in the preprocessing stage, an original image with any size is uniformly scaled to 600 × 600 pixels, a 448 × 448 pixel region is cut out on the basis of taking the center of the image as an origin, the cut region is normalized according to the mean value [0.485,0.456,0.406] and the standard deviation [0.229,0.224,0.225], and then the normalized image is input into a lightweight basic feature extraction network ResNet-G;

(2) the feature tensors output by the input image through three different convolution layers of the lightweight basic network ResNet-G are respectively

And

wherein H_i、W_iAnd C_i(i ═ 1,2,3) respectively represent the height, width and number of channels of the convolution features;

(3) x, Y and Z are passed in parallel through three convolution kernels of size 1 × 1 and step size S_iThe number of input channels is C_iAnd independent linear mapping is carried out by a polynomial convolution module with the output channel number of D to generate a projection characteristic tensor

X,Y,Z∈R^H×W×D

In the formula, H and W respectively represent the height and width of the projection feature, D represents the projection dimension, and the convolution step S_iOutputting the height H of the feature tensor from each convolution layer_iCalculating the height H of the projected feature tensor to obtain;

(4) modeling the interactive information among the projection characteristics of each convolution layer by means of a low-rank approximate polynomial kernel pooling mode, and taking the polynomial kernel order r as 2 to ensure that the linear classifier has the discrimination capability equivalent to that of a second-order polynomial kernel classifier, then

In the formula (I), the compound is shown in the specification,

the operation of the dot product of the tensor is represented,

and

a second-order polynomial feature tensor representing a cross-layer;

(5) aggregating the feature information of all spatial positions in each channel of the second-order polynomial feature tensor by using global average pooling operation to obtain a polynomial feature vector, thereby further compressing the dimensionality of the feature vector;

in the formula (I), the compound is shown in the specification,

and

respectively representing corresponding feature tensors

And

the cross-layer second-order polynomial feature vector of (1), (2), (…), and (HW) represents the set of all spatial positions of the feature map;

(6) converging all cross-layer polynomial feature vectors through feature cascade, and outputting fine-grained image feature description vectors;

(7) normalizing the image feature description vector by using element-by-element symbol root mean square normalization;

(8) using L₂Normalizing the image feature description vectors by regularization;

(9) the feature description vector will be normalized

Inputting a classified full-connection layer;

in the formula, theta ∈ R^kOutput vector representing classified fully-connected layer, P ∈ R^k×3DRepresenting a weight parameter matrix of the classified fully-connected layer, wherein k represents the number of target categories;

(10) calculating the probability of the input image belonging to each category by combining a softmax function;

in the formula, η_iRepresenting the probability that the input image belongs to the ith category.

As a further improvement of the invention, the lightweight basic feature extraction network ResNet-G used in the step (1) is a network structure formed by adopting a novel residual error module based on a multilayer aggregation grouping convolution operation mode to replace an original Bottleneeck residual error module embedded depth residual error network ResNet-50 frame. The novel residual error module divides feature extraction into two branches of convolution operation and identity mapping, and feature information is propagated in the two branches in parallel and finally converged and output in an element-by-element corresponding addition mode. Meanwhile, the novel residual error module performs partial decoupling on the space relation and channel relation synchronous learning mode of the feature tensor of the conventional convolution in the convolution operation branch, so that the convolution operation is simplified. Setting input and output data dimensions to be 256, setting an intermediate layer dimension to be 64, setting the grouping number to be g, and setting the output channel number of each group of sub convolution layers to be m, wherein the convolution operation branch of the novel residual error module comprises the following specific steps:

(1) reducing the dimension of the characteristic by using a convolution layer with the convolution kernel size of 1 multiplied by 1, the number of input channels of 256 and the number of output channels of 64;

(2) dividing the dimensionality reduction feature tensor into g groups on the channel level, wherein the number of channels corresponding to each group of sub-features is equal to the number of channels

Simultaneously numbering all the sub-feature groups;

(3) generating corresponding copy information by the 1 st group of characteristics, using one path of direct current for subsequent characteristic cascade operation, superposing the other path of direct current into the 2 nd group of characteristics, and inputting the number of channels as

The number of output channels is

The convolution of 3 × 3 further extracts feature information;

(4) corresponding copy information is generated again by the output characteristics, one path of direct current is used for characteristic cascade, the other path of direct current is superposed with the 3 rd group of characteristics, the size of a convolution kernel is 3 × 3, and the number of input channels is

The number of output channels is

The convolutional layer of (1);

(5) analogizing in sequence until all the g groups of characteristics pass through 3 multiplied by 3 convolution operation, and then cascading each group of direct current characteristics with the g-th group of convolution characteristics to obtain 64-dimensional characteristic information;

(6) a 1 x 1 convolution with an input channel number of 64 and an output channel number of 256 is used to restore the 64-dimensional features to the original dimensions 256.

As a further improvement of the invention, in the step (4), the low-rank approximate polynomial kernel pooling is used for fitting the homologous polynomial kernel function in the support vector machine by means of tensor decomposition thought, so that the linear classifier has the discrimination capability equivalent to that of a high-rank polynomial kernel classifier, specifically, in order to obtain the performance of the r-rank polynomial kernel classifier, r convolution kernels with the size of 1 × 1, the number of input channels of C and the number of output channels of D are combined firstly, the output channels of the R-rank polynomial kernel pooling are combined^rThe independent convolution layer forms an R-order polynomial convolution module, and then a preprocessed fine-grained image is input into a convolution feature tensor X ∈ R obtained by a lightweight basic network ResNet-G^H×W×CPerforming linear mapping through an r-order polynomial convolution module to generate a projection characteristic tensor set

Wherein

Finally, combining r projected feature tensors by adopting tensor dot product operation to obtain high-order statistical information of features

In the formula (I), the compound is shown in the specification,

the operation of the dot product of the tensor is represented,

approximate r order statistics representing convolution features X of underlying network output。

The invention discloses a lightweight fine-grained image identification method of cross-layer feature interaction in a weak supervision scene, which has the beneficial effects that: the method is characterized in that partial decoupling is carried out on a spatial relation and channel relation synchronous learning mode of characteristic information by utilizing multilayer aggregation grouping convolution instead of conventional convolution, a novel residual error module is constructed, and the novel residual error module is directly embedded into a depth residual error network ResNet-50 framework to carry out light weight processing on a basic network, so that convolution operation is simplified, and the parameter quantity of the basic network is reduced. Meanwhile, the dimensionality of the image feature description vector is reduced by adopting low-rank approximate polynomial kernel pooling, and further the storage space and the calculation cost of the classification full-connection layer are compressed. In addition, the low-rank approximate polynomial kernel pooling enables the linear classifier to have the discrimination capability equivalent to that of a high-order polynomial kernel classifier, and the fitting capability of the model to complex feature distribution can be effectively improved. Finally, the cross-layer feature interaction network framework is used for fusing interaction information of each layer of the basic network, feature diversity is combined to enhance feature expression and learning capacity, generalization performance of the whole model is improved, overfitting risk is reduced, and comprehensive performance in three aspects of identification accuracy, calculation complexity and technical feasibility is in the leading position at present.

Drawings

FIG. 1 is a schematic overall framework of the present invention;

FIG. 2 is a schematic diagram of a novel residual module according to the present invention;

FIG. 3 is a graph of the impact of polynomial order and projection dimension on identification accuracy in accordance with the present invention;

FIG. 4 is a visualization result of the output characteristics of the low-rank approximate polynomial kernel pooling partial convolution layer of the present invention.

Detailed Description

The invention is described in further detail below with reference to the following detailed description and accompanying drawings:

the invention provides a light-weight fine-grained image recognition method for cross-layer feature interaction in a weak supervision scene, and solves the technical problem that only an image-level label is used for constructing a fine-grained recognition model, so that the storage space and the calculation cost of the model are reduced while higher recognition accuracy is obtained, and the method is suitable for large-scale real scenes.

As shown in fig. 1, a lightweight fine-grained image recognition method based on cross-layer feature interaction in a weak surveillance scene includes the following steps:

step 1: in the preprocessing stage, an original image with any size is uniformly scaled to 600 × 600 pixels, a 448 × 448 pixel region is cut out on the basis of taking the center of the image as an origin, the cut region is normalized according to the mean value [0.485,0.456,0.406] and the standard deviation [0.229,0.224,0.225], and then the normalized image is input into a lightweight basic feature extraction network ResNet-G.

Step 2: the feature tensors output by the input image through three different convolution layers of the lightweight basic network ResNet-G are respectively

And

wherein H_i、W_iAnd C_i(i ═ 1,2,3) respectively represent the height, width and number of channels of the convolution signature.

Step 3, X, Y and Z are parallelly passed through three convolution kernels with the size of 1 × 1 and the step size of S_iThe number of input channels is C_iAnd independent linear mapping is carried out by a polynomial convolution module with the output channel number of D to generate a projection characteristic tensor

X,Y,Z∈R^H×W×D

In the formula, H and W respectively represent the height and width of the projection feature, D represents the projection dimension, and the convolution step S_iOutputting the height H of the feature tensor from each convolution layer_iAnd the height H of the projected feature tensor is calculated.

And 4, step 4: modeling the interactive information among the projection characteristics of each convolution layer by means of a low-rank approximate polynomial kernel pooling mode, and taking the polynomial kernel order r as 2 to ensure that the linear classifier has the discrimination capability equivalent to that of a second-order polynomial kernel classifier, then

In the formula (I), the compound is shown in the specification,

the operation of the dot product of the tensor is represented,

and

representing a cross-layer second order polynomial feature tensor.

And 5: aggregating the feature information of all spatial positions in each channel of the second-order polynomial feature tensor by using global average pooling operation to obtain a polynomial feature vector, thereby further compressing the dimensionality of the feature vector;

in the formula (I), the compound is shown in the specification,

and

respectively representing corresponding feature tensors

And

the cross-layer second-order polynomial feature vector of (1), (2), (…), and (HW) represents the set of all spatial positions of the feature map.

Step 6: converging all cross-layer polynomial feature vectors through feature cascade to output fine-grained image feature description vectors

And 7: normalizing image feature description vectors using element-wise symbol root mean square normalization

And 8: using L₂Regularization carries out standardization processing on image feature description vectors

And step 9: the feature description vector will be normalized

Inputting a classified full-connection layer;

in the formula, theta ∈ R^kOutput vector representing classified fully-connected layer, P ∈ R^k×3DAnd a weight parameter matrix for classifying the fully-connected layers is represented, and k represents the number of target classes.

Step 10: calculating the probability of the input image belonging to each category by combining a softmax function;

i＝1,2,…,k

Fig. 1 gives a general framework of the invention. Firstly, an original image is input into a lightweight basic network ResNet-G to extract characteristic information after being preprocessed, and low-rank basic network ResNet-G extracts characteristic informationAnd performing linear mapping by selecting the output activation of three different convolution layers through approximate polynomial kernel pooling to generate three projection feature tensors with the same dimensionality. And secondly, cross-layer cross-combining the projection features, measuring interaction among the features by using tensor dot product operation, and further compressing feature dimensions by combining global average pooling operation to obtain three D-dimensional cross-layer polynomial feature vectors. Then, all polynomial feature vector information is fused through feature cascade operation to generate fine-grained image feature description vectors, and element-by-element symbol root mean square normalization and L are used for normalization₂And normalizing the feature description vectors by means of regularization and the like. And finally, inputting the standardized cross-layer polynomial feature description vector into a classified fully-connected layer, and calculating the image class probability by using a softmax function.

Fig. 2 is a schematic structural diagram of a novel residual error module based on multi-layer aggregate packet convolution operation, in which the dimensions of input and output data are set to be 256, the packet number g is 4, and the number m of output channels of each sub-convolution layer is 16. The novel residual error module divides feature extraction into two branches of convolution operation and identity mapping, and feature information is propagated in the two branches in parallel and finally converged and output in an element-by-element corresponding addition mode. In the convolution operation branch, firstly, dimension reduction is carried out on the convolution layer with the convolution kernel size of 1 multiplied by 1, the number of input channels of 256 and the number of output channels of 64, dimension reduction features are divided into 4 groups on the channel level, and the number of channels corresponding to each group of sub-features is 16. Secondly, generating corresponding copies of the 1 st group of features, wherein one path of direct current is used for subsequent feature cascade operation, the other path of direct current is superposed into the 2 nd group of features, and feature information is further extracted through 3 multiplied by 3 convolution with the input channel number of 16 and the output channel number of 16. And then, generating a corresponding copy again by outputting the characteristics, wherein one path of direct current is used for characteristic cascade connection, the other path of direct current is superposed with the 3 rd group of characteristics, and the convolution layer has the convolution kernel size of 3 multiplied by 3, the number of input channels of 16 and the number of output channels of 16. And analogizing until all 4 groups of characteristics pass through 3 multiplied by 3 convolution operation, and then cascading each group of direct current characteristics with the 4 th group of convolution characteristics to obtain 64-dimensional characteristic information. Finally, the features are restored to the original dimensions using a 1 × 1 convolution with an input channel number of 64 and an output channel number of 256.

FIG. 3 shows the polynomial order r and the projection dimension D of the low-rank approximate polynomial kernel pooling of the present invention^rFor the influence of identification accuracy, a comparison experiment is based on a CUB-200-2011 fine-grained image data set, a basic network ResNet-G with the grouping number G being 4 and the output channel number m being 18 of each sub-convolution layer is used as an image feature extractor, and the convolution layer res5_ c is used for outputting activation to approximately model an r-order homologous polynomial kernel function. Projection dimension D of the convolution module when polynomial of order r^rWhen the model is changed from 512 to 32768, the identification accuracy of the model is improved. In particular, when r is 2, the projection dimension D^rThe classification accuracy of the 512-corresponding model is about 83.0%, while D^rThe classification accuracy corresponding to 32768 is about 86.3%, an increase of 3.3%. When projecting dimension D^rWhen the accuracy of the identification model corresponding to r-2 is increased from 86.0% to 86.3% from 8192 to 32768, the accuracy is only increased by 0.3%, and the improvement of the model performance is limited. And D^rThe polynomial eigenvector dimension corresponding to 32768 is D^rThe original convolution characteristics are linearly mapped by a polynomial convolution module with the output channel number of 8192 in subsequent experiments, so that the balance between the identification accuracy and the calculation complexity is obtained. Furthermore, D^rWhen the number of the polynomial kernel function is 2048, the accuracy of the identification model corresponding to the polynomial kernel function order r of 2 reaches 84.9%, the accuracy is increased by 1.8% compared with that of a linear SVM classifier, and therefore the low-rank approximate polynomial kernel pooling can effectively model fine-grained feature interaction, and the image feature description vector contains more discriminant information. And as the polynomial order r increases from 2 to 4, the recognition effect of the high-order polynomial kernel classifier is rather reduced. This is because the interaction between low-order features is more efficient and reliable. Therefore, the interactive information of the fine-grained image can be captured from the output feature tensor by using the polynomial kernel function with a relatively low order.

FIG. 4 visualizes the partial convolutional layer output features of the low-rank approximate polynomial kernel pooling of the present invention over the CUB-200-plus 2011, FGVC Aircraft, and Stanfordcars fine-grained image datasets. The characteristic response graph is obtained by calculating the average value of characteristic information in all channels, and the characteristic information corresponding to the projection layers proj5_ a, proj5_ b and proj5_ c is respectively generated by activating outputs of convolution layers res5_ a, res5_ b and res5_ c in the base network ResNet-G through a polynomial convolution module. As can be seen from the figure, for three types of fine-grained data sets, the method provided by the invention can automatically position the image to local key regions (white parts) with strong semantic and discriminability in the image by ignoring background interference, such as the head, the wing and the trunk of birds in the CUB-200-2011 data set, the cockpit, the engine and the tail stabilizer of an airplane in the FGVC Aircraft data set, and the bumper, the headlight, the wheels and the like of an automobile in the Stanford Cars data set. For analysis of a single test picture, convolutional layers res5_ a, res5_ b and res5_ c provide rough spatial position information of a target, and contain certain noise, projection layers proj5_ a, pro5j _ b and pro5j _ c are further refined and have certain bias on the basis, each key area in the target is located and extracted with features, then low-rank approximate polynomial kernel pooling models interaction among feature information of different key parts, potential relations among the local areas are mined and captured, a plurality of cross-layer interaction information is integrated to achieve image perception from the local to the whole, and the process is in accordance with human cognition. The method can autonomously position and sense each key part of the target in the fine-grained image, and can explain how to effectively and accurately capture the slight difference between different types of targets under the condition that the position of the target is not explicitly detected, so that better identification accuracy is obtained.

Table 1 lists the results of the ResNet-G experiments at different super parameter settings and compares them to ResNet-50, ResNext networks. The parameter g represents the grouping number of the 3 x 3 convolutional layers in the novel residual error module, the parameter m represents the number of output channels of each convolutional layer, the identification model directly connects the feature tensors extracted from different basic networks with the full-connection layer after global average pooling, and then the target class probability is calculated by utilizing a softmax function. According to the data in the table, ResNet-G can be found to effectively compress the storage space of the model by introducing a novel residual error module. It is worth noting that although the packet convolution breaks the connection between the spatial position of the feature tensor and the channel, it does not necessarily cause the network feature extraction capability to be reduced. In particular, the identification accuracy of the ResNet-G model corresponding to the superparameters G-4 and m-24 is 84.0%, and is even improved by 1.8% compared with the ResNet-50. The novel residual error module also uses a short circuit connection structure in the grouping convolution, on one hand, multi-scale and multi-level feature information can be fused, and on the other hand, each group of convolution can reduce information loss caused by decoupling of the grouping convolution on the feature space position relation and the channel relation by collecting all channel information of the previous group. ResNet-G sets the grouping number to be 4, when the input channel number is set to be 18, the recognition accuracy reaches 83.1 percent, which is 0.9 percent higher than ResNet-50, and the model storage space only occupies 68.8 percent of ResNet-50, and meanwhile, the calculated amount is reduced by nearly 30 percent. Therefore, in subsequent experiments and analysis, we will use this hyper-parameter setting to construct the image basis feature extractor. ResNext compensates for the information loss caused by packet convolution by increasing the number of input and output channels of the middle 3 x 3 convolutional layer, and thus the amount of parameters and computations of the overall network increases. Under the same superparameter setting, i.e., G4 and M24, the overall classification accuracy of ResNext is 83.4%, which is 0.6% less than that of ResNet-G, while the corresponding model memory space and computation are 90.16M and 16.9GFLOPs, respectively, which are 8.0% and 15.0% more than that of ResNet-G. Therefore, the expression capability of the features can be obviously enhanced under the condition that the parameter quantity of the convolution layer is not increased by fusing multi-scale information through short circuit connection in the grouping convolution, and the identification accuracy of the model is further improved. Meanwhile, the first group of feature tensors in the novel residual module can be directly subjected to feature cascade operation with the subsequent group of features without any convolution operation, so that the parameter quantity and the calculated quantity of the model can be reduced. In addition, from the analysis of the network structure and the expandability, ResNet-G can complete the construction of the whole network by stacking novel residual modules with the same topological structure, and the process only relates to the setting of two types of super parameters. However, the currently mainstream inclusion series lightweight networks include a large number of artificially set hyper-parameters, which need to be adjusted and modified according to data distribution, resulting in increased design burden. In conclusion, the ResNet-G basic network based on the novel residual error module has a remarkable performance in the aspects of architecture, feature learning and computational complexity.

TABLE 1 ResNet, ResNext, and ResNet-G basic network Performance comparison

Table 2 compares the complexity of the low-rank approximate polynomial kernel pooling of the present invention with the complexity of the recognition model classification layers corresponding to the other two pooling modes, where H, W, C and k represent the height, width, number of channels and number of classes of objects to be classified, respectively. The numbers in brackets are typical values calculated by applying three pooling methods for the CUB-200 and 2011 fine-grained image recognition task. Bilinear pooling of Bilinear CNN (B-CNN) to capture correlation between feature channels results in an increase in feature description vector dimension to C²For the k-200 classification task, the parameter amount of the fully-connected layer occupies 200MB of storage space. Boost-CNN improves the classification effect of the model by integrating 9B-CNNs, and if each base learner outputs C²Dimensional feature vector, Boost-CNN will generate 9. C in the training process²The simulation result shows that the 8192-dimensional TS feature has the same characterization capability as the 262K-dimensional bilinear feature, and the classification layer parameter is compressed by nearly 96.5%, but the operation speed of the CBP is reduced on the contrary because the CBP pooling process relates to fast Fourier transform FFT, so that the operation speed of the CBP is reduced.

TABLE 2 computational complexity contrast for multiple pooling modes

Table 3 compares the lightweight fine-grained image recognition method of cross-layer feature interaction in the weakly supervised scene related to the present invention with the mainstream fine-grained recognition method. By observing data in the table, the comprehensive performance of the cross-layer feature interaction lightweight fine-grained image identification method in the aspects of identification accuracy, model complexity and technical feasibility is at the leading level at present. Two-Level is a fine-grained image recognition model based on weak supervision information, object-Level and part-Level classifiers are respectively trained by the model by utilizing image-Level labels and spatial position information of target key parts, the parameter amount reaches 138.4M, which is 2.05 times of that of the invention, and the recognition accuracy on a CUB-200-plus-2011 data set is only 75.7%, which is 12.2% less than that of the invention. The PN-CNN is used as a strong supervision fine-grained identification model, a posture alignment operation is added on the basis of Part-based RCNN, and the identification accuracy of 85.4% is obtained on a CUB-200-plus-2011 data set. The PN-CNN utilizes three groups of AlexNet networks which are independent to each other to extract the characteristics of the whole target region, the head region and the trunk region, and fuses a plurality of groups of characteristics in a cascading mode to enable the final characteristic description vector to simultaneously contain the information of the target region and the local key region. The parameters of the three sets of basic networks are different, and each network comprises a separate full connection layer, so that the overall model parameter quantity is increased to 173.0M. Under the same thought, the Mask-CNN also adopts a plurality of groups of mutually independent basic sub-networks to sense the whole and local feature information of the target, and is different from the PN-CNN in that the Mask-CNN firstly performs global average pooling and maximum pooling after obtaining the convolution features, thereby reducing feature dimensions, and then predicts the target category by combining the feature information of all the sub-networks through feature cascading. This operation can significantly reduce the number of parameters and the amount of computation of the fully connected layer. Compared with PN-CNN, the Mask-CNN recognition accuracy rate of the VGG16 basic network reaches 85.7%, is improved by 0.3%, the model parameter number is only 60.5M, and the compression is nearly 65.0%. The identification accuracy of the method is 87.9 percent, which is 0.6 percent higher than that of Mask-CNN based on ResNet-50, but the model parameters only account for 77.8 percent of the Mask-CNN. In addition, the Mask-CNN needs additional spatial position information of key parts in addition to the image-level category labels in the training stage, so that the Mask-CNN is inferior to the Mask-CNN in the aspects of identification accuracy, computational complexity and technical feasibility compared with the method. RA-CNN is a cyclic self-attention fine-grained identification model which is composed of a triple network, wherein each sub-network comprises a classification module and an attention recommendation APN module. RA-CNN continuously enlarges local areas through APN, so that the model is gradually gathered to a target key part in the training process, and recognition accuracy rates of 85.3%, 88.2% and 92.5% are respectively obtained on CUB-200-plus 2011, FGVC air craft and Stanford Cars data sets, which are reduced by 2.6%, 3.7% and 1.6% compared with the method. RA-CNN contains a triple network and is trained serially, resulting in model parameters as high as 429.0M, 6.36 times that of the present invention. MA-CNN is a weakly supervised fine-grained recognition model built on top of a single underlying network. The MA-CNN utilizes the channel grouping module to autonomously generate attention areas and extract corresponding characteristic information, and inputs each attention area into a separate full-connection layer. The MA-CNN parameter reaches 144M when the attention area is 4, which is 2.14 times of the invention. Meanwhile, the MA-CNN trains the channel grouping module and the classification module alternately, the training mode is more complicated and is easy to fall into a local optimal solution, and the model parameters can be updated by adopting an end-to-end training mode.

TABLE 3 comparison of the Performance of the present invention with classical fine-grained image recognition models

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims

1. The method for recognizing the lightweight fine-grained image of cross-layer feature interaction in the weak supervision scene is characterized by comprising the following steps of:

And

X,Y,Z∈R^H×W×D

In the formula (I), the compound is shown in the specification,

the operation of the dot product of the tensor is represented,

and

a second-order polynomial feature tensor representing a cross-layer;

in the formula (I), the compound is shown in the specification,

and

respectively representing corresponding feature tensors

And

(9) the feature description vector will be normalized

Inputting a classified full-connection layer;

2. The method for recognizing the cross-layer feature interaction lightweight fine-grained image under the weak supervision scene according to claim 1, wherein the lightweight basic feature extraction network ResNet-G used in the step (1) is a network structure formed by replacing an original Bottleneck residual module with a novel residual module based on a multilayer aggregation grouping convolution operation mode and embedding a depth residual network ResNet-50 frame. The novel residual error module divides feature extraction into two branches of convolution operation and identity mapping, and feature information is propagated in the two branches in parallel and finally converged and output in an element-by-element corresponding addition mode. Meanwhile, the novel residual error module performs partial decoupling on the space relation and channel relation synchronous learning mode of the feature tensor of the conventional convolution in the convolution operation branch, so that the convolution operation is simplified. Setting input and output data dimensions to be 256, setting an intermediate layer dimension to be 64, setting the grouping number to be g, and setting the output channel number of each group of sub convolution layers to be m, wherein the convolution operation branch of the novel residual error module comprises the following specific steps:

Simultaneously numbering all the sub-feature groups;

The number of output channels is

The convolution of 3 × 3 further extracts feature information;

The number of output channels is

The convolutional layer of (1);

3. The method for identifying the lightweight fine-grained image of cross-layer feature interaction under the weak supervision scene according to claim 1, wherein in the step (4), the pooling of the low-rank approximate polynomial kernel is performed by fitting a homopolynomial kernel function in a support vector machine through a tensor decomposition idea, so that a linear classifier has the discrimination capability equivalent to that of a high-order polynomial kernel classifier, specifically, in order to obtain the performance of an r-order polynomial kernel classifier, the pooling of the low-rank approximate polynomial kernel is performed by firstly combining r convolution kernels with the size of 1 × 1, the number of input channels of C and the number of output channels of D^rThe independent convolution layer forms an R-order polynomial convolution module, and then a preprocessed fine-grained image is input into a convolution feature tensor X ∈ R obtained by a lightweight basic network ResNet-G^H×W×CPerforming linear mapping through an r-order polynomial convolution module to generate a projection characteristic tensor set

Wherein

In the formula (I), the compound is shown in the specification,

the operation of the dot product of the tensor is represented,

approximate r-order statistics representing the convolution characteristics X of the underlying network output.