CN111652236A - Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene - Google Patents

Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene Download PDF

Info

Publication number
CN111652236A
CN111652236A CN202010505152.2A CN202010505152A CN111652236A CN 111652236 A CN111652236 A CN 111652236A CN 202010505152 A CN202010505152 A CN 202010505152A CN 111652236 A CN111652236 A CN 111652236A
Authority
CN
China
Prior art keywords
feature
convolution
layer
image
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010505152.2A
Other languages
Chinese (zh)
Other versions
CN111652236B (en
Inventor
李春国
刘杨
杨哲
胡健
杨绿溪
徐琴珍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Publication of CN111652236A publication Critical patent/CN111652236A/en
Application granted granted Critical
Publication of CN111652236B publication Critical patent/CN111652236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

According to the method, multilayer aggregate packet convolution is used for replacing conventional convolution to construct a novel residual error module, and the novel residual error module is directly embedded into a deep residual error network framework to achieve light weight of a basic network. Then, modeling is carried out on interaction among the features through efficient calculation and low-rank approximate polynomial kernel pooling, the feature description vector dimension is compressed, storage occupation and calculation cost of a classification full-connection layer are reduced, meanwhile, the pooling scheme enables the linear classifier to have the discrimination capacity equivalent to that of a high-order polynomial kernel classifier, and identification precision is remarkably improved. Finally, combining feature diversity by adopting a cross-layer feature interaction network framework, enhancing feature learning and expression capability and reducing overfitting risk. The comprehensive performance of the light-weight fine-grained image recognition method of cross-layer feature interaction in the weak supervision scene in three aspects of recognition accuracy, calculation complexity and technical feasibility is at the leading level at present.

Description

Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene
Technical Field
The invention belongs to the field of computer vision, in particular to a method for identifying fine-grained images by using image-level label weak supervision information and combining low-rank approximate polynomial nuclear pooling and a cross-layer feature interaction network framework, and particularly relates to a light-weight fine-grained image identification method for cross-layer feature interaction in a weak supervision scene.
Background
With the rapid development of internet technology, the human society has advanced into the information age, and the total amount of data resources stored in various ways such as text, image, voice, video and the like in the network has exponentially increased. The image data is vivid and visual, is not limited by regions and languages, gradually becomes a mainstream information carrier, and has wide application prospect and practical research significance. Meanwhile, the parallel computing theory and the upgrading of hardware equipment promote the massive image processing to be possible, so that the research trend in the computer vision field including image recognition, target detection, semantic segmentation and the like is raised. Image identification is a fundamental research topic in the field of computer vision, and the main task is to preprocess an acquired image, extract characteristic information on the basis, and construct a classifier according to the characteristic information so as to judge a target class in the image. In conventional image recognition, the object classes to be recognized are usually coarse-grained, such as pedestrians, cats and dogs, vehicles, and the like. Such cross-species targets exhibit significant differences in appearance and no membership, and therefore are less difficult to identify. However, in many real applications, the target to be identified belongs to a fine-grained category, i.e. belongs to different subclasses under a certain coarse-grained category, such as different varieties of flowers, various types of automobiles, and the like. Compared with a coarse-grained image recognition task, the appearance similarity of the targets of different subclasses in the fine-grained data set is high, and the visual difference of the targets of the same subclass is obvious due to the factors such as postures, visual angles and shielding.
The deep learning image recognition technology based on the mass data and by means of the artificial neural network for autonomously learning the high-level semantic features of the image can describe image information from multiple angles and at multiple levels, has strong robustness, and draws wide attention in academic and industrial fields. At present, a large number of scholars construct a large number of deep learning models and apply the deep learning models to a fine-grained image recognition task to obtain a preliminary research result. According to the strength of the supervision information depended by the model in the training stage, the fine-grained image recognition technology based on deep learning can be further divided into strong supervision fine-grained image recognition and weak supervision fine-grained image recognition. The strong supervision fine-grained image recognition algorithm is no longer difficult to realize high-precision recognition by introducing additional supervision information and assisting a complex detection model. However, the manually labeled supervision information is expensive to obtain, and the application of the technology in a large-scale real scene is limited. Meanwhile, the target category can be accurately judged only by image-level labels in the model training stage by the weak supervision fine-grained image recognition, the practicability and the expandability are strong, and the method becomes the mainstream trend of fine-grained image recognition research in the current stage. The weak supervision Biliner CNN extracts image features by using two mutually independent basic networks and captures pairwise correlation among feature channels through matrix outer products to obtain second-order statistical information of convolution features, so that a linear classifier has the same discrimination capability as a second-order polynomial kernel classifier (see T.Lin, S.major.Biliner. Bilinear conditional Neural network for Fine-granular Visual Recognition, 2015.). Improved B-CNN performs root mean square normalization on the bilinear feature description matrix to compress the dynamic range of the feature values, and combines L2The stabilization of the model is further improved by means such as regularization (details of the method)See t.lin.improved bililinear Pooling with CNNs, 2017.). Boost-CNN combines a plurality of Bilinear CNNs with weak classification capability in a Boosting mode by means of the idea of ensemble learning, and solves a least square function to determine a weight coefficient of each base learner so as to construct a strong classifier (see M.Mohammad. Boost conditional Neural Networks, 2016). CBP fits a second order polynomial kernel using two approximation algorithms, Random Maclaurin (RM) and Tensorssketch (TS), such that the 8192-dimensional TS feature has the same expressive power as the 262K-dimensional Bilinear feature (see y. Considering the phenomenon of information loss in the forward propagation process of the convolutional neural network, the Bilinear pooling is performed by the aid of top-layer convolutional activation of the deep neural network by the aid of the Bilinear CNN and various variant algorithms, but the semantics of all key regions of an image are not sufficiently described by the aid of features from a single convolutional layer, and the loss of discriminative information which is significant to fine-grained image identification can be caused by directly taking the features as reference features. In addition, the bilinear pooling utilizes the matrix outer product operation to capture the pairwise correlation among the feature channels, so that the identification accuracy is promoted to be remarkably improved, however, the operation causes the dimensionality of the feature description vector to be increased to 262K, and the parameter quantity and the calculation quantity of the full connection layer are linearly increased. Although the CBP can reduce the dimensionality of the feature description vector to a certain extent by fitting a second-order polynomial kernel function by using a low-dimensional random projection RM and TS algorithm, the operation time is greatly increased because the calculation process of the CBP involves Fourier transform.
In summary, for the task of weakly supervised fine-grained image recognition only using image-level label information, it is difficult for the existing method to realize high-precision recognition under the condition of low model parameter and computation amount, so a light-weight fine-grained image recognition method of cross-layer feature interaction with balanced recognition accuracy and computation complexity is needed.
Disclosure of Invention
In order to solve the above problems, the invention provides a light-weight fine-grained image recognition method for cross-layer feature interaction in a weak supervision scene, and the technical problem to be solved is that only an image-level label is used for constructing a fine-grained recognition model, so that the storage space and the calculation cost of the model are reduced while higher recognition accuracy is obtained, and the model is suitable for a large-scale real scene, and in order to achieve the purpose, the invention provides the light-weight fine-grained image recognition method for cross-layer feature interaction in the weak supervision scene, which comprises the following steps:
(1) in the preprocessing stage, an original image with any size is uniformly scaled to 600 × 600 pixels, a 448 × 448 pixel region is cut out on the basis of taking the center of the image as an origin, the cut region is normalized according to the mean value [0.485,0.456,0.406] and the standard deviation [0.229,0.224,0.225], and then the normalized image is input into a lightweight basic feature extraction network ResNet-G;
(2) the feature tensors output by the input image through three different convolution layers of the lightweight basic network ResNet-G are respectively
Figure BDA0002526270480000021
And
Figure BDA0002526270480000022
wherein Hi、WiAnd Ci(i ═ 1,2,3) respectively represent the height, width and number of channels of the convolution features;
(3) x, Y and Z are passed in parallel through three convolution kernels of size 1 × 1 and step size SiThe number of input channels is CiAnd independent linear mapping is carried out by a polynomial convolution module with the output channel number of D to generate a projection characteristic tensor
X,Y,Z∈RH×W×D
In the formula, H and W respectively represent the height and width of the projection feature, D represents the projection dimension, and the convolution step SiOutputting the height H of the feature tensor from each convolution layeriCalculating the height H of the projected feature tensor to obtain;
(4) modeling the interactive information among the projection characteristics of each convolution layer by means of a low-rank approximate polynomial kernel pooling mode, and taking the polynomial kernel order r as 2 to ensure that the linear classifier has the discrimination capability equivalent to that of a second-order polynomial kernel classifier, then
Figure BDA0002526270480000031
Figure BDA0002526270480000032
Figure BDA0002526270480000033
In the formula (I), the compound is shown in the specification,
Figure BDA00025262704800000318
the operation of the dot product of the tensor is represented,
Figure BDA0002526270480000034
and
Figure BDA0002526270480000035
a second-order polynomial feature tensor representing a cross-layer;
(5) aggregating the feature information of all spatial positions in each channel of the second-order polynomial feature tensor by using global average pooling operation to obtain a polynomial feature vector, thereby further compressing the dimensionality of the feature vector;
Figure BDA0002526270480000036
in the formula (I), the compound is shown in the specification,
Figure BDA0002526270480000037
and
Figure BDA0002526270480000038
respectively representing corresponding feature tensors
Figure BDA0002526270480000039
And
Figure BDA00025262704800000310
the cross-layer second-order polynomial feature vector of (1), (2), (…), and (HW) represents the set of all spatial positions of the feature map;
(6) converging all cross-layer polynomial feature vectors through feature cascade, and outputting fine-grained image feature description vectors;
Figure BDA00025262704800000311
(7) normalizing the image feature description vector by using element-by-element symbol root mean square normalization;
Figure BDA00025262704800000312
(8) using L2Normalizing the image feature description vectors by regularization;
Figure BDA00025262704800000313
(9) the feature description vector will be normalized
Figure BDA00025262704800000314
Inputting a classified full-connection layer;
Figure BDA00025262704800000315
in the formula, theta ∈ RkOutput vector representing classified fully-connected layer, P ∈ Rk×3DRepresenting a weight parameter matrix of the classified fully-connected layer, wherein k represents the number of target categories;
(10) calculating the probability of the input image belonging to each category by combining a softmax function;
Figure BDA00025262704800000316
in the formula, ηiRepresenting the probability that the input image belongs to the ith category.
As a further improvement of the invention, the lightweight basic feature extraction network ResNet-G used in the step (1) is a network structure formed by adopting a novel residual error module based on a multilayer aggregation grouping convolution operation mode to replace an original Bottleneeck residual error module embedded depth residual error network ResNet-50 frame. The novel residual error module divides feature extraction into two branches of convolution operation and identity mapping, and feature information is propagated in the two branches in parallel and finally converged and output in an element-by-element corresponding addition mode. Meanwhile, the novel residual error module performs partial decoupling on the space relation and channel relation synchronous learning mode of the feature tensor of the conventional convolution in the convolution operation branch, so that the convolution operation is simplified. Setting input and output data dimensions to be 256, setting an intermediate layer dimension to be 64, setting the grouping number to be g, and setting the output channel number of each group of sub convolution layers to be m, wherein the convolution operation branch of the novel residual error module comprises the following specific steps:
(1) reducing the dimension of the characteristic by using a convolution layer with the convolution kernel size of 1 multiplied by 1, the number of input channels of 256 and the number of output channels of 64;
(2) dividing the dimensionality reduction feature tensor into g groups on the channel level, wherein the number of channels corresponding to each group of sub-features is equal to the number of channels
Figure BDA0002526270480000041
Simultaneously numbering all the sub-feature groups;
(3) generating corresponding copy information by the 1 st group of characteristics, using one path of direct current for subsequent characteristic cascade operation, superposing the other path of direct current into the 2 nd group of characteristics, and inputting the number of channels as
Figure BDA0002526270480000042
The number of output channels is
Figure BDA0002526270480000043
The convolution of 3 × 3 further extracts feature information;
(4) corresponding copy information is generated again by the output characteristics, one path of direct current is used for characteristic cascade, the other path of direct current is superposed with the 3 rd group of characteristics, the size of a convolution kernel is 3 × 3, and the number of input channels is
Figure BDA0002526270480000044
The number of output channels is
Figure BDA0002526270480000045
The convolutional layer of (1);
(5) analogizing in sequence until all the g groups of characteristics pass through 3 multiplied by 3 convolution operation, and then cascading each group of direct current characteristics with the g-th group of convolution characteristics to obtain 64-dimensional characteristic information;
(6) a 1 x 1 convolution with an input channel number of 64 and an output channel number of 256 is used to restore the 64-dimensional features to the original dimensions 256.
As a further improvement of the invention, in the step (4), the low-rank approximate polynomial kernel pooling is used for fitting the homologous polynomial kernel function in the support vector machine by means of tensor decomposition thought, so that the linear classifier has the discrimination capability equivalent to that of a high-rank polynomial kernel classifier, specifically, in order to obtain the performance of the r-rank polynomial kernel classifier, r convolution kernels with the size of 1 × 1, the number of input channels of C and the number of output channels of D are combined firstly, the output channels of the R-rank polynomial kernel pooling are combinedrThe independent convolution layer forms an R-order polynomial convolution module, and then a preprocessed fine-grained image is input into a convolution feature tensor X ∈ R obtained by a lightweight basic network ResNet-GH×W×CPerforming linear mapping through an r-order polynomial convolution module to generate a projection characteristic tensor set
Figure BDA0002526270480000046
Wherein
Figure BDA0002526270480000047
Finally, combining r projected feature tensors by adopting tensor dot product operation to obtain high-order statistical information of features
Figure BDA0002526270480000051
In the formula (I), the compound is shown in the specification,
Figure BDA0002526270480000055
the operation of the dot product of the tensor is represented,
Figure BDA0002526270480000052
approximate r order statistics representing convolution features X of underlying network output。
The invention discloses a lightweight fine-grained image identification method of cross-layer feature interaction in a weak supervision scene, which has the beneficial effects that: the method is characterized in that partial decoupling is carried out on a spatial relation and channel relation synchronous learning mode of characteristic information by utilizing multilayer aggregation grouping convolution instead of conventional convolution, a novel residual error module is constructed, and the novel residual error module is directly embedded into a depth residual error network ResNet-50 framework to carry out light weight processing on a basic network, so that convolution operation is simplified, and the parameter quantity of the basic network is reduced. Meanwhile, the dimensionality of the image feature description vector is reduced by adopting low-rank approximate polynomial kernel pooling, and further the storage space and the calculation cost of the classification full-connection layer are compressed. In addition, the low-rank approximate polynomial kernel pooling enables the linear classifier to have the discrimination capability equivalent to that of a high-order polynomial kernel classifier, and the fitting capability of the model to complex feature distribution can be effectively improved. Finally, the cross-layer feature interaction network framework is used for fusing interaction information of each layer of the basic network, feature diversity is combined to enhance feature expression and learning capacity, generalization performance of the whole model is improved, overfitting risk is reduced, and comprehensive performance in three aspects of identification accuracy, calculation complexity and technical feasibility is in the leading position at present.
Drawings
FIG. 1 is a schematic overall framework of the present invention;
FIG. 2 is a schematic diagram of a novel residual module according to the present invention;
FIG. 3 is a graph of the impact of polynomial order and projection dimension on identification accuracy in accordance with the present invention;
FIG. 4 is a visualization result of the output characteristics of the low-rank approximate polynomial kernel pooling partial convolution layer of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
the invention provides a light-weight fine-grained image recognition method for cross-layer feature interaction in a weak supervision scene, and solves the technical problem that only an image-level label is used for constructing a fine-grained recognition model, so that the storage space and the calculation cost of the model are reduced while higher recognition accuracy is obtained, and the method is suitable for large-scale real scenes.
As shown in fig. 1, a lightweight fine-grained image recognition method based on cross-layer feature interaction in a weak surveillance scene includes the following steps:
step 1: in the preprocessing stage, an original image with any size is uniformly scaled to 600 × 600 pixels, a 448 × 448 pixel region is cut out on the basis of taking the center of the image as an origin, the cut region is normalized according to the mean value [0.485,0.456,0.406] and the standard deviation [0.229,0.224,0.225], and then the normalized image is input into a lightweight basic feature extraction network ResNet-G.
Step 2: the feature tensors output by the input image through three different convolution layers of the lightweight basic network ResNet-G are respectively
Figure BDA0002526270480000053
And
Figure BDA0002526270480000054
wherein Hi、WiAnd Ci(i ═ 1,2,3) respectively represent the height, width and number of channels of the convolution signature.
Step 3, X, Y and Z are parallelly passed through three convolution kernels with the size of 1 × 1 and the step size of SiThe number of input channels is CiAnd independent linear mapping is carried out by a polynomial convolution module with the output channel number of D to generate a projection characteristic tensor
X,Y,Z∈RH×W×D
In the formula, H and W respectively represent the height and width of the projection feature, D represents the projection dimension, and the convolution step SiOutputting the height H of the feature tensor from each convolution layeriAnd the height H of the projected feature tensor is calculated.
And 4, step 4: modeling the interactive information among the projection characteristics of each convolution layer by means of a low-rank approximate polynomial kernel pooling mode, and taking the polynomial kernel order r as 2 to ensure that the linear classifier has the discrimination capability equivalent to that of a second-order polynomial kernel classifier, then
Figure BDA0002526270480000061
Figure BDA0002526270480000062
Figure BDA0002526270480000063
In the formula (I), the compound is shown in the specification,
Figure BDA00025262704800000616
the operation of the dot product of the tensor is represented,
Figure BDA0002526270480000064
and
Figure BDA0002526270480000065
representing a cross-layer second order polynomial feature tensor.
And 5: aggregating the feature information of all spatial positions in each channel of the second-order polynomial feature tensor by using global average pooling operation to obtain a polynomial feature vector, thereby further compressing the dimensionality of the feature vector;
Figure BDA0002526270480000066
in the formula (I), the compound is shown in the specification,
Figure BDA0002526270480000067
and
Figure BDA0002526270480000068
respectively representing corresponding feature tensors
Figure BDA0002526270480000069
And
Figure BDA00025262704800000610
the cross-layer second-order polynomial feature vector of (1), (2), (…), and (HW) represents the set of all spatial positions of the feature map.
Step 6: converging all cross-layer polynomial feature vectors through feature cascade to output fine-grained image feature description vectors
Figure BDA00025262704800000611
And 7: normalizing image feature description vectors using element-wise symbol root mean square normalization
Figure BDA00025262704800000612
And 8: using L2Regularization carries out standardization processing on image feature description vectors
Figure BDA00025262704800000613
And step 9: the feature description vector will be normalized
Figure BDA00025262704800000614
Inputting a classified full-connection layer;
Figure BDA00025262704800000615
in the formula, theta ∈ RkOutput vector representing classified fully-connected layer, P ∈ Rk×3DAnd a weight parameter matrix for classifying the fully-connected layers is represented, and k represents the number of target classes.
Step 10: calculating the probability of the input image belonging to each category by combining a softmax function;
Figure BDA0002526270480000071
i=1,2,…,k
in the formula, ηiRepresenting the probability that the input image belongs to the ith category.
Fig. 1 gives a general framework of the invention. Firstly, an original image is input into a lightweight basic network ResNet-G to extract characteristic information after being preprocessed, and low-rank basic network ResNet-G extracts characteristic informationAnd performing linear mapping by selecting the output activation of three different convolution layers through approximate polynomial kernel pooling to generate three projection feature tensors with the same dimensionality. And secondly, cross-layer cross-combining the projection features, measuring interaction among the features by using tensor dot product operation, and further compressing feature dimensions by combining global average pooling operation to obtain three D-dimensional cross-layer polynomial feature vectors. Then, all polynomial feature vector information is fused through feature cascade operation to generate fine-grained image feature description vectors, and element-by-element symbol root mean square normalization and L are used for normalization2And normalizing the feature description vectors by means of regularization and the like. And finally, inputting the standardized cross-layer polynomial feature description vector into a classified fully-connected layer, and calculating the image class probability by using a softmax function.
Fig. 2 is a schematic structural diagram of a novel residual error module based on multi-layer aggregate packet convolution operation, in which the dimensions of input and output data are set to be 256, the packet number g is 4, and the number m of output channels of each sub-convolution layer is 16. The novel residual error module divides feature extraction into two branches of convolution operation and identity mapping, and feature information is propagated in the two branches in parallel and finally converged and output in an element-by-element corresponding addition mode. In the convolution operation branch, firstly, dimension reduction is carried out on the convolution layer with the convolution kernel size of 1 multiplied by 1, the number of input channels of 256 and the number of output channels of 64, dimension reduction features are divided into 4 groups on the channel level, and the number of channels corresponding to each group of sub-features is 16. Secondly, generating corresponding copies of the 1 st group of features, wherein one path of direct current is used for subsequent feature cascade operation, the other path of direct current is superposed into the 2 nd group of features, and feature information is further extracted through 3 multiplied by 3 convolution with the input channel number of 16 and the output channel number of 16. And then, generating a corresponding copy again by outputting the characteristics, wherein one path of direct current is used for characteristic cascade connection, the other path of direct current is superposed with the 3 rd group of characteristics, and the convolution layer has the convolution kernel size of 3 multiplied by 3, the number of input channels of 16 and the number of output channels of 16. And analogizing until all 4 groups of characteristics pass through 3 multiplied by 3 convolution operation, and then cascading each group of direct current characteristics with the 4 th group of convolution characteristics to obtain 64-dimensional characteristic information. Finally, the features are restored to the original dimensions using a 1 × 1 convolution with an input channel number of 64 and an output channel number of 256.
FIG. 3 shows the polynomial order r and the projection dimension D of the low-rank approximate polynomial kernel pooling of the present inventionrFor the influence of identification accuracy, a comparison experiment is based on a CUB-200-2011 fine-grained image data set, a basic network ResNet-G with the grouping number G being 4 and the output channel number m being 18 of each sub-convolution layer is used as an image feature extractor, and the convolution layer res5_ c is used for outputting activation to approximately model an r-order homologous polynomial kernel function. Projection dimension D of the convolution module when polynomial of order rrWhen the model is changed from 512 to 32768, the identification accuracy of the model is improved. In particular, when r is 2, the projection dimension DrThe classification accuracy of the 512-corresponding model is about 83.0%, while DrThe classification accuracy corresponding to 32768 is about 86.3%, an increase of 3.3%. When projecting dimension DrWhen the accuracy of the identification model corresponding to r-2 is increased from 86.0% to 86.3% from 8192 to 32768, the accuracy is only increased by 0.3%, and the improvement of the model performance is limited. And DrThe polynomial eigenvector dimension corresponding to 32768 is DrThe original convolution characteristics are linearly mapped by a polynomial convolution module with the output channel number of 8192 in subsequent experiments, so that the balance between the identification accuracy and the calculation complexity is obtained. Furthermore, DrWhen the number of the polynomial kernel function is 2048, the accuracy of the identification model corresponding to the polynomial kernel function order r of 2 reaches 84.9%, the accuracy is increased by 1.8% compared with that of a linear SVM classifier, and therefore the low-rank approximate polynomial kernel pooling can effectively model fine-grained feature interaction, and the image feature description vector contains more discriminant information. And as the polynomial order r increases from 2 to 4, the recognition effect of the high-order polynomial kernel classifier is rather reduced. This is because the interaction between low-order features is more efficient and reliable. Therefore, the interactive information of the fine-grained image can be captured from the output feature tensor by using the polynomial kernel function with a relatively low order.
FIG. 4 visualizes the partial convolutional layer output features of the low-rank approximate polynomial kernel pooling of the present invention over the CUB-200-plus 2011, FGVC Aircraft, and Stanfordcars fine-grained image datasets. The characteristic response graph is obtained by calculating the average value of characteristic information in all channels, and the characteristic information corresponding to the projection layers proj5_ a, proj5_ b and proj5_ c is respectively generated by activating outputs of convolution layers res5_ a, res5_ b and res5_ c in the base network ResNet-G through a polynomial convolution module. As can be seen from the figure, for three types of fine-grained data sets, the method provided by the invention can automatically position the image to local key regions (white parts) with strong semantic and discriminability in the image by ignoring background interference, such as the head, the wing and the trunk of birds in the CUB-200-2011 data set, the cockpit, the engine and the tail stabilizer of an airplane in the FGVC Aircraft data set, and the bumper, the headlight, the wheels and the like of an automobile in the Stanford Cars data set. For analysis of a single test picture, convolutional layers res5_ a, res5_ b and res5_ c provide rough spatial position information of a target, and contain certain noise, projection layers proj5_ a, pro5j _ b and pro5j _ c are further refined and have certain bias on the basis, each key area in the target is located and extracted with features, then low-rank approximate polynomial kernel pooling models interaction among feature information of different key parts, potential relations among the local areas are mined and captured, a plurality of cross-layer interaction information is integrated to achieve image perception from the local to the whole, and the process is in accordance with human cognition. The method can autonomously position and sense each key part of the target in the fine-grained image, and can explain how to effectively and accurately capture the slight difference between different types of targets under the condition that the position of the target is not explicitly detected, so that better identification accuracy is obtained.
Table 1 lists the results of the ResNet-G experiments at different super parameter settings and compares them to ResNet-50, ResNext networks. The parameter g represents the grouping number of the 3 x 3 convolutional layers in the novel residual error module, the parameter m represents the number of output channels of each convolutional layer, the identification model directly connects the feature tensors extracted from different basic networks with the full-connection layer after global average pooling, and then the target class probability is calculated by utilizing a softmax function. According to the data in the table, ResNet-G can be found to effectively compress the storage space of the model by introducing a novel residual error module. It is worth noting that although the packet convolution breaks the connection between the spatial position of the feature tensor and the channel, it does not necessarily cause the network feature extraction capability to be reduced. In particular, the identification accuracy of the ResNet-G model corresponding to the superparameters G-4 and m-24 is 84.0%, and is even improved by 1.8% compared with the ResNet-50. The novel residual error module also uses a short circuit connection structure in the grouping convolution, on one hand, multi-scale and multi-level feature information can be fused, and on the other hand, each group of convolution can reduce information loss caused by decoupling of the grouping convolution on the feature space position relation and the channel relation by collecting all channel information of the previous group. ResNet-G sets the grouping number to be 4, when the input channel number is set to be 18, the recognition accuracy reaches 83.1 percent, which is 0.9 percent higher than ResNet-50, and the model storage space only occupies 68.8 percent of ResNet-50, and meanwhile, the calculated amount is reduced by nearly 30 percent. Therefore, in subsequent experiments and analysis, we will use this hyper-parameter setting to construct the image basis feature extractor. ResNext compensates for the information loss caused by packet convolution by increasing the number of input and output channels of the middle 3 x 3 convolutional layer, and thus the amount of parameters and computations of the overall network increases. Under the same superparameter setting, i.e., G4 and M24, the overall classification accuracy of ResNext is 83.4%, which is 0.6% less than that of ResNet-G, while the corresponding model memory space and computation are 90.16M and 16.9GFLOPs, respectively, which are 8.0% and 15.0% more than that of ResNet-G. Therefore, the expression capability of the features can be obviously enhanced under the condition that the parameter quantity of the convolution layer is not increased by fusing multi-scale information through short circuit connection in the grouping convolution, and the identification accuracy of the model is further improved. Meanwhile, the first group of feature tensors in the novel residual module can be directly subjected to feature cascade operation with the subsequent group of features without any convolution operation, so that the parameter quantity and the calculated quantity of the model can be reduced. In addition, from the analysis of the network structure and the expandability, ResNet-G can complete the construction of the whole network by stacking novel residual modules with the same topological structure, and the process only relates to the setting of two types of super parameters. However, the currently mainstream inclusion series lightweight networks include a large number of artificially set hyper-parameters, which need to be adjusted and modified according to data distribution, resulting in increased design burden. In conclusion, the ResNet-G basic network based on the novel residual error module has a remarkable performance in the aspects of architecture, feature learning and computational complexity.
TABLE 1 ResNet, ResNext, and ResNet-G basic network Performance comparison
Figure BDA0002526270480000091
Table 2 compares the complexity of the low-rank approximate polynomial kernel pooling of the present invention with the complexity of the recognition model classification layers corresponding to the other two pooling modes, where H, W, C and k represent the height, width, number of channels and number of classes of objects to be classified, respectively. The numbers in brackets are typical values calculated by applying three pooling methods for the CUB-200 and 2011 fine-grained image recognition task. Bilinear pooling of Bilinear CNN (B-CNN) to capture correlation between feature channels results in an increase in feature description vector dimension to C2For the k-200 classification task, the parameter amount of the fully-connected layer occupies 200MB of storage space. Boost-CNN improves the classification effect of the model by integrating 9B-CNNs, and if each base learner outputs C2Dimensional feature vector, Boost-CNN will generate 9. C in the training process2The simulation result shows that the 8192-dimensional TS feature has the same characterization capability as the 262K-dimensional bilinear feature, and the classification layer parameter is compressed by nearly 96.5%, but the operation speed of the CBP is reduced on the contrary because the CBP pooling process relates to fast Fourier transform FFT, so that the operation speed of the CBP is reduced.
TABLE 2 computational complexity contrast for multiple pooling modes
Figure BDA0002526270480000101
Table 3 compares the lightweight fine-grained image recognition method of cross-layer feature interaction in the weakly supervised scene related to the present invention with the mainstream fine-grained recognition method. By observing data in the table, the comprehensive performance of the cross-layer feature interaction lightweight fine-grained image identification method in the aspects of identification accuracy, model complexity and technical feasibility is at the leading level at present. Two-Level is a fine-grained image recognition model based on weak supervision information, object-Level and part-Level classifiers are respectively trained by the model by utilizing image-Level labels and spatial position information of target key parts, the parameter amount reaches 138.4M, which is 2.05 times of that of the invention, and the recognition accuracy on a CUB-200-plus-2011 data set is only 75.7%, which is 12.2% less than that of the invention. The PN-CNN is used as a strong supervision fine-grained identification model, a posture alignment operation is added on the basis of Part-based RCNN, and the identification accuracy of 85.4% is obtained on a CUB-200-plus-2011 data set. The PN-CNN utilizes three groups of AlexNet networks which are independent to each other to extract the characteristics of the whole target region, the head region and the trunk region, and fuses a plurality of groups of characteristics in a cascading mode to enable the final characteristic description vector to simultaneously contain the information of the target region and the local key region. The parameters of the three sets of basic networks are different, and each network comprises a separate full connection layer, so that the overall model parameter quantity is increased to 173.0M. Under the same thought, the Mask-CNN also adopts a plurality of groups of mutually independent basic sub-networks to sense the whole and local feature information of the target, and is different from the PN-CNN in that the Mask-CNN firstly performs global average pooling and maximum pooling after obtaining the convolution features, thereby reducing feature dimensions, and then predicts the target category by combining the feature information of all the sub-networks through feature cascading. This operation can significantly reduce the number of parameters and the amount of computation of the fully connected layer. Compared with PN-CNN, the Mask-CNN recognition accuracy rate of the VGG16 basic network reaches 85.7%, is improved by 0.3%, the model parameter number is only 60.5M, and the compression is nearly 65.0%. The identification accuracy of the method is 87.9 percent, which is 0.6 percent higher than that of Mask-CNN based on ResNet-50, but the model parameters only account for 77.8 percent of the Mask-CNN. In addition, the Mask-CNN needs additional spatial position information of key parts in addition to the image-level category labels in the training stage, so that the Mask-CNN is inferior to the Mask-CNN in the aspects of identification accuracy, computational complexity and technical feasibility compared with the method. RA-CNN is a cyclic self-attention fine-grained identification model which is composed of a triple network, wherein each sub-network comprises a classification module and an attention recommendation APN module. RA-CNN continuously enlarges local areas through APN, so that the model is gradually gathered to a target key part in the training process, and recognition accuracy rates of 85.3%, 88.2% and 92.5% are respectively obtained on CUB-200-plus 2011, FGVC air craft and Stanford Cars data sets, which are reduced by 2.6%, 3.7% and 1.6% compared with the method. RA-CNN contains a triple network and is trained serially, resulting in model parameters as high as 429.0M, 6.36 times that of the present invention. MA-CNN is a weakly supervised fine-grained recognition model built on top of a single underlying network. The MA-CNN utilizes the channel grouping module to autonomously generate attention areas and extract corresponding characteristic information, and inputs each attention area into a separate full-connection layer. The MA-CNN parameter reaches 144M when the attention area is 4, which is 2.14 times of the invention. Meanwhile, the MA-CNN trains the channel grouping module and the classification module alternately, the training mode is more complicated and is easy to fall into a local optimal solution, and the model parameters can be updated by adopting an end-to-end training mode.
TABLE 3 comparison of the Performance of the present invention with classical fine-grained image recognition models
Figure BDA0002526270480000111
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims (3)

1. The method for recognizing the lightweight fine-grained image of cross-layer feature interaction in the weak supervision scene is characterized by comprising the following steps of:
(1) in the preprocessing stage, an original image with any size is uniformly scaled to 600 × 600 pixels, a 448 × 448 pixel region is cut out on the basis of taking the center of the image as an origin, the cut region is normalized according to the mean value [0.485,0.456,0.406] and the standard deviation [0.229,0.224,0.225], and then the normalized image is input into a lightweight basic feature extraction network ResNet-G;
(2) the feature tensors output by the input image through three different convolution layers of the lightweight basic network ResNet-G are respectively
Figure FDA0002526270470000011
And
Figure FDA0002526270470000012
wherein Hi、WiAnd Ci(i ═ 1,2,3) respectively represent the height, width and number of channels of the convolution features;
(3) x, Y and Z are passed in parallel through three convolution kernels of size 1 × 1 and step size SiThe number of input channels is CiAnd independent linear mapping is carried out by a polynomial convolution module with the output channel number of D to generate a projection characteristic tensor
X,Y,Z∈RH×W×D
In the formula, H and W respectively represent the height and width of the projection feature, D represents the projection dimension, and the convolution step SiOutputting the height H of the feature tensor from each convolution layeriCalculating the height H of the projected feature tensor to obtain;
(4) modeling the interactive information among the projection characteristics of each convolution layer by means of a low-rank approximate polynomial kernel pooling mode, and taking the polynomial kernel order r as 2 to ensure that the linear classifier has the discrimination capability equivalent to that of a second-order polynomial kernel classifier, then
Figure FDA0002526270470000013
Figure FDA0002526270470000014
Figure FDA0002526270470000015
In the formula (I), the compound is shown in the specification,
Figure FDA0002526270470000016
the operation of the dot product of the tensor is represented,
Figure FDA0002526270470000017
and
Figure FDA0002526270470000018
a second-order polynomial feature tensor representing a cross-layer;
(5) aggregating the feature information of all spatial positions in each channel of the second-order polynomial feature tensor by using global average pooling operation to obtain a polynomial feature vector, thereby further compressing the dimensionality of the feature vector;
Figure FDA0002526270470000019
in the formula (I), the compound is shown in the specification,
Figure FDA00025262704700000110
and
Figure FDA00025262704700000111
respectively representing corresponding feature tensors
Figure FDA00025262704700000112
And
Figure FDA00025262704700000113
the cross-layer second-order polynomial feature vector of (1), (2), (…), and (HW) represents the set of all spatial positions of the feature map;
(6) converging all cross-layer polynomial feature vectors through feature cascade, and outputting fine-grained image feature description vectors;
Figure FDA00025262704700000114
(7) normalizing the image feature description vector by using element-by-element symbol root mean square normalization;
Figure FDA0002526270470000021
(8) using L2Normalizing the image feature description vectors by regularization;
Figure FDA0002526270470000022
(9) the feature description vector will be normalized
Figure FDA0002526270470000023
Inputting a classified full-connection layer;
Figure FDA0002526270470000024
in the formula, theta ∈ RkOutput vector representing classified fully-connected layer, P ∈ Rk×3DRepresenting a weight parameter matrix of the classified fully-connected layer, wherein k represents the number of target categories;
(10) calculating the probability of the input image belonging to each category by combining a softmax function;
Figure FDA0002526270470000025
in the formula, ηiRepresenting the probability that the input image belongs to the ith category.
2. The method for recognizing the cross-layer feature interaction lightweight fine-grained image under the weak supervision scene according to claim 1, wherein the lightweight basic feature extraction network ResNet-G used in the step (1) is a network structure formed by replacing an original Bottleneck residual module with a novel residual module based on a multilayer aggregation grouping convolution operation mode and embedding a depth residual network ResNet-50 frame. The novel residual error module divides feature extraction into two branches of convolution operation and identity mapping, and feature information is propagated in the two branches in parallel and finally converged and output in an element-by-element corresponding addition mode. Meanwhile, the novel residual error module performs partial decoupling on the space relation and channel relation synchronous learning mode of the feature tensor of the conventional convolution in the convolution operation branch, so that the convolution operation is simplified. Setting input and output data dimensions to be 256, setting an intermediate layer dimension to be 64, setting the grouping number to be g, and setting the output channel number of each group of sub convolution layers to be m, wherein the convolution operation branch of the novel residual error module comprises the following specific steps:
(1) reducing the dimension of the characteristic by using a convolution layer with the convolution kernel size of 1 multiplied by 1, the number of input channels of 256 and the number of output channels of 64;
(2) dividing the dimensionality reduction feature tensor into g groups on the channel level, wherein the number of channels corresponding to each group of sub-features is equal to the number of channels
Figure FDA0002526270470000026
Simultaneously numbering all the sub-feature groups;
(3) generating corresponding copy information by the 1 st group of characteristics, using one path of direct current for subsequent characteristic cascade operation, superposing the other path of direct current into the 2 nd group of characteristics, and inputting the number of channels as
Figure FDA0002526270470000027
The number of output channels is
Figure FDA0002526270470000028
The convolution of 3 × 3 further extracts feature information;
(4) corresponding copy information is generated again by the output characteristics, one path of direct current is used for characteristic cascade, the other path of direct current is superposed with the 3 rd group of characteristics, the size of a convolution kernel is 3 × 3, and the number of input channels is
Figure FDA0002526270470000031
The number of output channels is
Figure FDA0002526270470000032
The convolutional layer of (1);
(5) analogizing in sequence until all the g groups of characteristics pass through 3 multiplied by 3 convolution operation, and then cascading each group of direct current characteristics with the g-th group of convolution characteristics to obtain 64-dimensional characteristic information;
(6) a 1 x 1 convolution with an input channel number of 64 and an output channel number of 256 is used to restore the 64-dimensional features to the original dimensions 256.
3. The method for identifying the lightweight fine-grained image of cross-layer feature interaction under the weak supervision scene according to claim 1, wherein in the step (4), the pooling of the low-rank approximate polynomial kernel is performed by fitting a homopolynomial kernel function in a support vector machine through a tensor decomposition idea, so that a linear classifier has the discrimination capability equivalent to that of a high-order polynomial kernel classifier, specifically, in order to obtain the performance of an r-order polynomial kernel classifier, the pooling of the low-rank approximate polynomial kernel is performed by firstly combining r convolution kernels with the size of 1 × 1, the number of input channels of C and the number of output channels of DrThe independent convolution layer forms an R-order polynomial convolution module, and then a preprocessed fine-grained image is input into a convolution feature tensor X ∈ R obtained by a lightweight basic network ResNet-GH×W×CPerforming linear mapping through an r-order polynomial convolution module to generate a projection characteristic tensor set
Figure FDA0002526270470000033
Wherein
Figure FDA0002526270470000034
Finally, combining r projected feature tensors by adopting tensor dot product operation to obtain high-order statistical information of features
Figure FDA0002526270470000035
In the formula (I), the compound is shown in the specification,
Figure FDA0002526270470000036
the operation of the dot product of the tensor is represented,
Figure FDA0002526270470000037
approximate r-order statistics representing the convolution characteristics X of the underlying network output.
CN202010505152.2A 2020-04-21 2020-06-05 Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene Active CN111652236B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010317205 2020-04-21
CN2020103172058 2020-04-21

Publications (2)

Publication Number Publication Date
CN111652236A true CN111652236A (en) 2020-09-11
CN111652236B CN111652236B (en) 2022-04-29

Family

ID=72347337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010505152.2A Active CN111652236B (en) 2020-04-21 2020-06-05 Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene

Country Status (1)

Country Link
CN (1) CN111652236B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101547A (en) * 2020-09-14 2020-12-18 中国科学院上海微系统与信息技术研究所 Pruning method and device for network model, electronic equipment and storage medium
CN112183602A (en) * 2020-09-22 2021-01-05 天津大学 Multi-layer feature fusion fine-grained image classification method with parallel rolling blocks
CN112381061A (en) * 2020-12-04 2021-02-19 中国科学院大学 Facial expression recognition method and system
CN112465118A (en) * 2020-11-26 2021-03-09 大连理工大学 Low-rank generation type countermeasure network construction method for medical image generation
CN112507982A (en) * 2021-02-02 2021-03-16 成都东方天呈智能科技有限公司 Cross-model conversion system and method for face feature codes
CN112686242A (en) * 2020-12-29 2021-04-20 昆明理工大学 Fine-grained image classification method based on multilayer focusing attention network
CN113222998A (en) * 2021-04-13 2021-08-06 天津大学 Semi-supervised image semantic segmentation method and device based on self-supervised low-rank network
CN113240659A (en) * 2021-05-26 2021-08-10 广州天鹏计算机科技有限公司 Image feature extraction method based on deep learning
CN113327284A (en) * 2021-05-27 2021-08-31 北京百度网讯科技有限公司 Image recognition method and device, electronic equipment and storage medium
CN113343991A (en) * 2021-08-02 2021-09-03 四川新网银行股份有限公司 Feature-enhanced weak supervised learning method
CN113441411A (en) * 2021-07-31 2021-09-28 北京五指术健康科技有限公司 Rubbish letter sorting equipment based on augmented reality
WO2022095584A1 (en) * 2020-11-06 2022-05-12 神思电子技术股份有限公司 Image recognition method based on stream convolution
CN114745465A (en) * 2022-03-24 2022-07-12 马斌斌 Interactive noise self-prior sensing analysis system for smart phone
CN114898775A (en) * 2022-04-24 2022-08-12 中国科学院声学研究所南海研究站 Voice emotion recognition method and system based on cross-layer cross fusion
CN116503671A (en) * 2023-06-25 2023-07-28 电子科技大学 Image classification method based on residual network compression of effective rank tensor approximation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359684A (en) * 2018-10-17 2019-02-19 苏州大学 Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement
CN109409384A (en) * 2018-09-30 2019-03-01 内蒙古科技大学 Image-recognizing method, device, medium and equipment based on fine granularity image
CN110147841A (en) * 2019-05-22 2019-08-20 桂林电子科技大学 The fine grit classification method for being detected and being divided based on Weakly supervised and unsupervised component
CN110378356A (en) * 2019-07-16 2019-10-25 北京中科研究院 Fine granularity image-recognizing method based on multiple target Lagrange canonical
CN110689091A (en) * 2019-10-18 2020-01-14 中国科学技术大学 Weak supervision fine-grained object classification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409384A (en) * 2018-09-30 2019-03-01 内蒙古科技大学 Image-recognizing method, device, medium and equipment based on fine granularity image
CN109359684A (en) * 2018-10-17 2019-02-19 苏州大学 Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement
CN110147841A (en) * 2019-05-22 2019-08-20 桂林电子科技大学 The fine grit classification method for being detected and being divided based on Weakly supervised and unsupervised component
CN110378356A (en) * 2019-07-16 2019-10-25 北京中科研究院 Fine granularity image-recognizing method based on multiple target Lagrange canonical
CN110689091A (en) * 2019-10-18 2020-01-14 中国科学技术大学 Weak supervision fine-grained object classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
李思瑶等: "多尺度特征融合的细粒度图像分类", 《激光与光电子学进展》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101547B (en) * 2020-09-14 2024-04-16 中国科学院上海微系统与信息技术研究所 Pruning method and device for network model, electronic equipment and storage medium
CN112101547A (en) * 2020-09-14 2020-12-18 中国科学院上海微系统与信息技术研究所 Pruning method and device for network model, electronic equipment and storage medium
CN112183602A (en) * 2020-09-22 2021-01-05 天津大学 Multi-layer feature fusion fine-grained image classification method with parallel rolling blocks
WO2022095584A1 (en) * 2020-11-06 2022-05-12 神思电子技术股份有限公司 Image recognition method based on stream convolution
CN112465118A (en) * 2020-11-26 2021-03-09 大连理工大学 Low-rank generation type countermeasure network construction method for medical image generation
CN112465118B (en) * 2020-11-26 2022-09-16 大连理工大学 Low-rank generation type countermeasure network construction method for medical image generation
CN112381061A (en) * 2020-12-04 2021-02-19 中国科学院大学 Facial expression recognition method and system
CN112381061B (en) * 2020-12-04 2022-07-12 中国科学院大学 Facial expression recognition method and system
CN112686242A (en) * 2020-12-29 2021-04-20 昆明理工大学 Fine-grained image classification method based on multilayer focusing attention network
CN112507982A (en) * 2021-02-02 2021-03-16 成都东方天呈智能科技有限公司 Cross-model conversion system and method for face feature codes
CN113222998B (en) * 2021-04-13 2022-05-31 天津大学 Semi-supervised image semantic segmentation method and device based on self-supervised low-rank network
CN113222998A (en) * 2021-04-13 2021-08-06 天津大学 Semi-supervised image semantic segmentation method and device based on self-supervised low-rank network
CN113240659B (en) * 2021-05-26 2022-02-25 广州天鹏计算机科技有限公司 Heart nuclear magnetic resonance image lesion structure extraction method based on deep learning
CN113240659A (en) * 2021-05-26 2021-08-10 广州天鹏计算机科技有限公司 Image feature extraction method based on deep learning
CN113327284A (en) * 2021-05-27 2021-08-31 北京百度网讯科技有限公司 Image recognition method and device, electronic equipment and storage medium
CN113327284B (en) * 2021-05-27 2022-08-26 北京百度网讯科技有限公司 Image recognition method and device, electronic equipment and storage medium
CN113441411A (en) * 2021-07-31 2021-09-28 北京五指术健康科技有限公司 Rubbish letter sorting equipment based on augmented reality
CN113343991B (en) * 2021-08-02 2023-06-09 四川新网银行股份有限公司 Weak supervision learning method with enhanced characteristics
CN113343991A (en) * 2021-08-02 2021-09-03 四川新网银行股份有限公司 Feature-enhanced weak supervised learning method
CN114745465A (en) * 2022-03-24 2022-07-12 马斌斌 Interactive noise self-prior sensing analysis system for smart phone
CN114898775A (en) * 2022-04-24 2022-08-12 中国科学院声学研究所南海研究站 Voice emotion recognition method and system based on cross-layer cross fusion
CN116503671A (en) * 2023-06-25 2023-07-28 电子科技大学 Image classification method based on residual network compression of effective rank tensor approximation
CN116503671B (en) * 2023-06-25 2023-08-29 电子科技大学 Image classification method based on residual network compression of effective rank tensor approximation

Also Published As

Publication number Publication date
CN111652236B (en) 2022-04-29

Similar Documents

Publication Publication Date Title
CN111652236B (en) Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene
Yu et al. Hierarchical bilinear pooling for fine-grained visual recognition
Wang et al. Deep learning algorithms with applications to video analytics for a smart city: A survey
CN106599797B (en) A kind of infrared face recognition method based on local parallel neural network
Ji et al. 3D convolutional neural networks for human action recognition
CN112307995B (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
US10776628B2 (en) Video action localization from proposal-attention
US11223782B2 (en) Video processing using a spectral decomposition layer
US11270425B2 (en) Coordinate estimation on n-spheres with spherical regression
CN111460980A (en) Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
CN114596589A (en) Domain-adaptive pedestrian re-identification method based on interactive cascade lightweight transformations
John et al. Real-time hand posture and gesture-based touchless automotive user interface using deep learning
Yuan et al. Few-shot scene classification with multi-attention deepemd network in remote sensing
CN114998638A (en) Multi-view three-dimensional point cloud classification method based on dynamic and static convolution fusion neural network
Munoz Inference Machines Parsing Scenes via Iterated Predictions
Khellal et al. Pedestrian classification and detection in far infrared images
Qian et al. Classification of rice seed variety using point cloud data combined with deep learning
Van Hoai et al. Feeding Convolutional Neural Network by hand-crafted features based on Enhanced Neighbor-Center Different Image for color texture classification
CN114492634A (en) Fine-grained equipment image classification and identification method and system
Xin et al. Random part localization model for fine grained image classification
Özyurt et al. A new method for classification of images using convolutional neural network based on Dwt-Svd perceptual hash function
Sang et al. Image recognition based on multiscale pooling deep convolution neural networks
Trimech et al. Data augmentation using non-rigid cpd registration for 3d facial expression recognition
Khan et al. Texture gradient and deep features fusion-based image scene geometry recognition system using extreme learning machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant