CN111652236B - Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene - Google Patents

Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene Download PDF

Info

Publication number
CN111652236B
CN111652236B CN202010505152.2A CN202010505152A CN111652236B CN 111652236 B CN111652236 B CN 111652236B CN 202010505152 A CN202010505152 A CN 202010505152A CN 111652236 B CN111652236 B CN 111652236B
Authority
CN
China
Prior art keywords
feature
convolution
layer
image
tensor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010505152.2A
Other languages
Chinese (zh)
Other versions
CN111652236A (en
Inventor
李春国
刘杨
杨哲
胡健
杨绿溪
徐琴珍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Southeast University
Original Assignee
Southeast University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Southeast University filed Critical Southeast University
Publication of CN111652236A publication Critical patent/CN111652236A/en
Application granted granted Critical
Publication of CN111652236B publication Critical patent/CN111652236B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/32Normalisation of the pattern dimensions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

According to the method, multilayer aggregate packet convolution is used for replacing conventional convolution to construct a novel residual error module, and the novel residual error module is directly embedded into a deep residual error network framework to achieve light weight of a basic network. Then, modeling is carried out on interaction among the features through efficient calculation and low-rank approximate polynomial kernel pooling, the feature description vector dimension is compressed, storage occupation and calculation cost of a classification full-connection layer are reduced, meanwhile, the pooling scheme enables the linear classifier to have the discrimination capacity equivalent to that of a high-order polynomial kernel classifier, and identification precision is remarkably improved. Finally, combining feature diversity by adopting a cross-layer feature interaction network framework, enhancing feature learning and expression capability and reducing overfitting risk. The comprehensive performance of the light-weight fine-grained image recognition method of cross-layer feature interaction in the weak supervision scene in three aspects of recognition accuracy, calculation complexity and technical feasibility is at the leading level at present.

Description

Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene
Technical Field
The invention belongs to the field of computer vision, in particular to a method for identifying fine-grained images by using image-level label weak supervision information and combining low-rank approximate polynomial nuclear pooling and a cross-layer feature interaction network framework, and particularly relates to a light-weight fine-grained image identification method for cross-layer feature interaction in a weak supervision scene.
Background
With the rapid development of internet technology, the human society has advanced into the information age, and the total amount of data resources stored in various ways such as text, image, voice, video and the like in the network has exponentially increased. The image data is vivid and visual, is not limited by regions and languages, gradually becomes a mainstream information carrier, and has wide application prospect and practical research significance. Meanwhile, the parallel computing theory and the upgrading of hardware equipment promote the massive image processing to be possible, so that the research trend in the computer vision field including image recognition, target detection, semantic segmentation and the like is raised. Image identification is a fundamental research topic in the field of computer vision, and the main task is to preprocess an acquired image, extract characteristic information on the basis, and construct a classifier according to the characteristic information so as to judge a target class in the image. In conventional image recognition, the object classes to be recognized are usually coarse-grained, such as pedestrians, cats and dogs, vehicles, and the like. Such cross-species targets exhibit significant differences in appearance and no membership, and therefore are less difficult to identify. However, in many real applications, the target to be identified belongs to a fine-grained category, i.e. belongs to different subclasses under a certain coarse-grained category, such as different varieties of flowers, various types of automobiles, and the like. Compared with a coarse-grained image recognition task, the appearance similarity of the targets of different subclasses in the fine-grained data set is high, and the visual difference of the targets of the same subclass is obvious due to the factors such as postures, visual angles and shielding.
Deep learning image identification technology for autonomously learning high-level semantic features of image by means of artificial neural network based on mass dataThe method can describe image information from multiple angles and multiple layers, has strong robustness, and draws wide attention in academic and industrial fields. At present, a large number of scholars construct a large number of deep learning models and apply the deep learning models to a fine-grained image recognition task to obtain a preliminary research result. According to the strength of the supervision information depended by the model in the training stage, the fine-grained image recognition technology based on deep learning can be further divided into strong supervision fine-grained image recognition and weak supervision fine-grained image recognition. The strong supervision fine-grained image recognition algorithm is no longer difficult to realize high-precision recognition by introducing additional supervision information and assisting a complex detection model. However, the manually labeled supervision information is expensive to obtain, and the application of the technology in a large-scale real scene is limited. Meanwhile, the target category can be accurately judged only by image-level labels in the model training stage by the weak supervision fine-grained image recognition, the practicability and the expandability are strong, and the method becomes the mainstream trend of fine-grained image recognition research in the current stage. The weak supervision Biliner CNN extracts image features by using two mutually independent basic Networks and captures pairwise correlation among feature channels through matrix outer products to obtain second-order statistical information of convolution features, so that a linear classifier has the same discrimination capability as a second-order polynomial kernel classifier (see T.Lin, S.major.Biliner conditional Neural Networks for Fine-granular Visual recognitions, 2015.). Improved B-CNN performs root mean square normalization on the bilinear feature description matrix to compress the dynamic range of the feature values, and combines L2Regularization and the like further improve the stability of the model (see t.lin.improved Bilinear Pooling with CNNs, 2017.). Boost-CNN combines a plurality of Bilinear CNNs with weak classification capability in a Boosting mode by means of the idea of ensemble learning, and solves a least square function to determine a weight coefficient of each base learner so as to construct a strong classifier (see M.Mohammad. boosted conditional Neural Networks, 2016). CBP fits a second order polynomial kernel using two approximation algorithms, Random Maclaurin (RM) and sensor Sketch (TS), such that the 8192-dimensional TS feature has the same expressive power as the 262K-dimensional Bilinear feature (see y. Considering the convolutional neural networkThe phenomenon of information loss exists in the forward propagation process, the Biliner CNN and various variant algorithms use the top layer convolution activation of a deep neural network to carry out Bilinear pooling, but the characteristics from a single convolution layer are not enough to describe the semantics of all key regions of an image, and the direct consideration of the characteristics as reference characteristics can cause that discriminant information which has significance for fine-grained image identification is lost. In addition, the bilinear pooling utilizes the matrix outer product operation to capture the pairwise correlation among the feature channels, so that the identification accuracy is promoted to be remarkably improved, however, the operation causes the dimensionality of the feature description vector to be increased to 262K, and the parameter quantity and the calculation quantity of the full connection layer are linearly increased. Although the CBP can reduce the dimensionality of the feature description vector to a certain extent by fitting a second-order polynomial kernel function by using a low-dimensional random projection RM and TS algorithm, the operation time is greatly increased because the calculation process of the CBP involves Fourier transform.
In summary, for the task of weakly supervised fine-grained image recognition only using image-level label information, it is difficult for the existing method to realize high-precision recognition under the condition of low model parameter and computation amount, so a light-weight fine-grained image recognition method of cross-layer feature interaction with balanced recognition accuracy and computation complexity is needed.
Disclosure of Invention
In order to solve the above problems, the invention provides a light-weight fine-grained image recognition method for cross-layer feature interaction in a weak supervision scene, and the technical problem to be solved is that only an image-level label is used for constructing a fine-grained recognition model, so that the storage space and the calculation cost of the model are reduced while higher recognition accuracy is obtained, and the model is suitable for a large-scale real scene, and in order to achieve the purpose, the invention provides the light-weight fine-grained image recognition method for cross-layer feature interaction in the weak supervision scene, which comprises the following steps:
(1) in the preprocessing stage, an original image with any size is uniformly scaled to 600 × 600 pixels, a 448 × 448 pixel region is cut out on the basis of taking the center of the image as an origin, the cut region is normalized according to the mean value [0.485,0.456,0.406] and the standard deviation [0.229,0.224,0.225], and then the normalized image is input into a lightweight basic feature extraction network ResNet-G;
(2) the feature tensors output by the input image through three different convolution layers of the lightweight basic network ResNet-G are respectively
Figure BDA0002526270480000021
And
Figure BDA0002526270480000022
wherein Hi、WiAnd Ci(i ═ 1,2,3) respectively represent the height, width and number of channels of the convolution features;
(3) x, Y and Z are passed in parallel through three convolution kernels of size 1X 1 and step size SiThe number of input channels is CiAnd independent linear mapping is carried out by a polynomial convolution module with the output channel number of D to generate a projection characteristic tensor
X,Y,Z∈RH×W×D
In the formula, H and W respectively represent the height and width of the projection feature, D represents the projection dimension, and the convolution step SiOutputting the height H of the feature tensor from each convolution layeriCalculating the height H of the projected feature tensor to obtain;
(4) modeling the interactive information among the projection characteristics of each convolution layer by means of a low-rank approximate polynomial kernel pooling mode, and taking the polynomial kernel order r as 2 to ensure that the linear classifier has the discrimination capability equivalent to that of a second-order polynomial kernel classifier, then
Figure BDA0002526270480000031
Figure BDA0002526270480000032
Figure BDA0002526270480000033
In the formula (I), the compound is shown in the specification,
Figure BDA00025262704800000318
the operation of the dot product of the tensor is represented,
Figure BDA0002526270480000034
and
Figure BDA0002526270480000035
a second-order polynomial feature tensor representing a cross-layer;
(5) aggregating the feature information of all spatial positions in each channel of the second-order polynomial feature tensor by using global average pooling operation to obtain a polynomial feature vector, thereby further compressing the dimensionality of the feature vector;
Figure BDA0002526270480000036
in the formula (I), the compound is shown in the specification,
Figure BDA0002526270480000037
and
Figure BDA0002526270480000038
respectively representing corresponding feature tensors
Figure BDA0002526270480000039
And
Figure BDA00025262704800000310
the cross-layer second-order polynomial feature vector of (1), (2), (…), and (HW) represents the set of all spatial positions of the feature map;
(6) converging all cross-layer polynomial feature vectors through feature cascade, and outputting fine-grained image feature description vectors;
Figure BDA00025262704800000311
(7) normalizing the image feature description vector by using element-by-element symbol root mean square normalization;
Figure BDA00025262704800000312
(8) using L2Normalizing the image feature description vectors by regularization;
Figure BDA00025262704800000313
(9) the feature description vector will be normalized
Figure BDA00025262704800000314
Inputting a classified full-connection layer;
Figure BDA00025262704800000315
in the formula, theta ∈ RkOutput vector representing the classified fully-connected layer, P ∈ Rk×3DRepresenting a weight parameter matrix of the classified fully-connected layer, wherein k represents the number of target categories;
(10) calculating the probability of the input image belonging to each category by combining a softmax function;
Figure BDA00025262704800000316
in the formula etaiRepresenting the probability that the input image belongs to the ith category.
As a further improvement of the invention, the lightweight basic feature extraction network ResNet-G used in the step (1) is a network structure formed by adopting a novel residual error module based on a multilayer aggregation grouping convolution operation mode to replace an original Bottleneeck residual error module embedded depth residual error network ResNet-50 frame. The novel residual error module divides feature extraction into two branches of convolution operation and identity mapping, and feature information is propagated in the two branches in parallel and finally converged and output in an element-by-element corresponding addition mode. Meanwhile, the novel residual error module performs partial decoupling on the space relation and channel relation synchronous learning mode of the feature tensor of the conventional convolution in the convolution operation branch, so that the convolution operation is simplified. Setting input and output data dimensions to be 256, setting an intermediate layer dimension to be 64, setting the grouping number to be g, and setting the output channel number of each group of sub convolution layers to be m, wherein the convolution operation branch of the novel residual error module comprises the following specific steps:
(1) reducing the dimension of the characteristic by using a convolution layer with the convolution kernel size of 1 multiplied by 1, the number of input channels of 256 and the number of output channels of 64;
(2) dividing the dimensionality reduction feature tensor into g groups on the channel level, wherein the number of channels corresponding to each group of sub-features is equal to the number of channels
Figure BDA0002526270480000041
Simultaneously numbering all the sub-feature groups;
(3) generating corresponding copy information by the 1 st group of characteristics, using one path of direct current for subsequent characteristic cascade operation, superposing the other path of direct current into the 2 nd group of characteristics, and inputting the number of channels as
Figure BDA0002526270480000042
The number of output channels is
Figure BDA0002526270480000043
Further extracting characteristic information by the 3 x 3 convolution;
(4) corresponding copy information is generated again by the output characteristics, one path of direct current is used for characteristic cascade, the other path of direct current is superposed with the 3 rd group of characteristics, the size of a convolution kernel is 3 multiplied by 3, and the number of input channels is 3
Figure BDA0002526270480000044
The number of output channels is
Figure BDA0002526270480000045
The convolutional layer of (1);
(5) analogizing in sequence until all the g groups of characteristics pass through 3 multiplied by 3 convolution operation, and then cascading each group of direct current characteristics with the g-th group of convolution characteristics to obtain 64-dimensional characteristic information;
(6) a 1 x 1 convolution with an input channel number of 64 and an output channel number of 256 is used to restore the 64-dimensional features to the original dimensions 256.
As a further improvement of the method, in the step (4), the low-rank approximate polynomial kernel pooling is used for fitting the homologous polynomial kernel function in the support vector machine by means of a tensor decomposition idea, so that the linear classifier has the discrimination capability equivalent to that of a high-order polynomial kernel classifier. Specifically, to obtain the performance of the r-order polynomial kernel classifier, the low-rank approximate polynomial kernel pooling is implemented by first combining r convolution kernels of 1 × 1 size, C input channels, D output channelsrThe independent convolution layers form an r-order polynomial convolution module; then inputting the preprocessed fine-grained image into a convolution feature tensor X epsilon obtained by a lightweight basic network ResNet-GH×W×CPerforming linear mapping through an r-order polynomial convolution module to generate a projection characteristic tensor set
Figure BDA0002526270480000046
Wherein
Figure BDA0002526270480000047
Finally, combining r projected feature tensors by adopting tensor dot product operation to obtain high-order statistical information of features
Figure BDA0002526270480000051
In the formula (I), the compound is shown in the specification,
Figure BDA0002526270480000055
the operation of the dot product of the tensor is represented,
Figure BDA0002526270480000052
approximate r-order statistics representing the convolution characteristics X of the underlying network output.
The invention discloses a lightweight fine-grained image identification method of cross-layer feature interaction in a weak supervision scene, which has the beneficial effects that: the method is characterized in that partial decoupling is carried out on a spatial relation and channel relation synchronous learning mode of characteristic information by utilizing multilayer aggregation grouping convolution instead of conventional convolution, a novel residual error module is constructed, and the novel residual error module is directly embedded into a depth residual error network ResNet-50 framework to carry out light weight processing on a basic network, so that convolution operation is simplified, and the parameter quantity of the basic network is reduced. Meanwhile, the dimensionality of the image feature description vector is reduced by adopting low-rank approximate polynomial kernel pooling, and further the storage space and the calculation cost of the classification full-connection layer are compressed. In addition, the low-rank approximate polynomial kernel pooling enables the linear classifier to have the discrimination capability equivalent to that of a high-order polynomial kernel classifier, and the fitting capability of the model to complex feature distribution can be effectively improved. Finally, the cross-layer feature interaction network framework is used for fusing interaction information of each layer of the basic network, feature diversity is combined to enhance feature expression and learning capacity, generalization performance of the whole model is improved, overfitting risk is reduced, and comprehensive performance in three aspects of identification accuracy, calculation complexity and technical feasibility is in the leading position at present.
Drawings
FIG. 1 is a schematic overall framework of the present invention;
FIG. 2 is a schematic diagram of a novel residual module according to the present invention;
FIG. 3 is a graph of the impact of polynomial order and projection dimension on identification accuracy in accordance with the present invention;
FIG. 4 is a visualization result of the output characteristics of the low-rank approximate polynomial kernel pooling partial convolution layer of the present invention.
Detailed Description
The invention is described in further detail below with reference to the following detailed description and accompanying drawings:
the invention provides a light-weight fine-grained image recognition method for cross-layer feature interaction in a weak supervision scene, and solves the technical problem that only an image-level label is used for constructing a fine-grained recognition model, so that the storage space and the calculation cost of the model are reduced while higher recognition accuracy is obtained, and the method is suitable for large-scale real scenes.
As shown in fig. 1, a lightweight fine-grained image recognition method based on cross-layer feature interaction in a weak surveillance scene includes the following steps:
step 1: in the preprocessing stage, an original image with any size is uniformly scaled to 600 × 600 pixels, a 448 × 448 pixel region is cut out on the basis of taking the center of the image as an origin, the cut region is normalized according to the mean value [0.485,0.456,0.406] and the standard deviation [0.229,0.224,0.225], and then the normalized image is input into a lightweight basic feature extraction network ResNet-G.
Step 2: the feature tensors output by the input image through three different convolution layers of the lightweight basic network ResNet-G are respectively
Figure BDA0002526270480000053
And
Figure BDA0002526270480000054
wherein Hi、WiAnd Ci(i ═ 1,2,3) respectively represent the height, width and number of channels of the convolution signature.
And step 3: x, Y and Z are passed in parallel through three convolution kernels of size 1X 1 and step size SiThe number of input channels is CiAnd independent linear mapping is carried out by a polynomial convolution module with the output channel number of D to generate a projection characteristic tensor
X,Y,Z∈RH×W×D
In the formula, H and W respectively represent the height and width of the projection feature, D represents the projection dimension, and the convolution step SiOutputting the height H of the feature tensor from each convolution layeriAnd the height H of the projected feature tensor is calculated.
And 4, step 4: modeling the interactive information among the projection characteristics of each convolution layer by means of a low-rank approximate polynomial kernel pooling mode, and taking the polynomial kernel order r as 2 to ensure that the linear classifier has the discrimination capability equivalent to that of a second-order polynomial kernel classifier, then
Figure BDA0002526270480000061
Figure BDA0002526270480000062
Figure BDA0002526270480000063
In the formula (I), the compound is shown in the specification,
Figure BDA00025262704800000616
the operation of the dot product of the tensor is represented,
Figure BDA0002526270480000064
and
Figure BDA0002526270480000065
representing a cross-layer second order polynomial feature tensor.
And 5: aggregating the feature information of all spatial positions in each channel of the second-order polynomial feature tensor by using global average pooling operation to obtain a polynomial feature vector, thereby further compressing the dimensionality of the feature vector;
Figure BDA0002526270480000066
in the formula (I), the compound is shown in the specification,
Figure BDA0002526270480000067
and
Figure BDA0002526270480000068
respectively representing corresponding feature tensors
Figure BDA0002526270480000069
And
Figure BDA00025262704800000610
the cross-layer second-order polynomial feature vector of (1), (2), (…), and (HW) represents the set of all spatial positions of the feature map.
Step 6: converging all cross-layer polynomial feature vectors through feature cascade to output fine-grained image feature description vectors
Figure BDA00025262704800000611
And 7: normalizing image feature description vectors using element-wise symbol root mean square normalization
Figure BDA00025262704800000612
And 8: using L2Regularization carries out standardization processing on image feature description vectors
Figure BDA00025262704800000613
And step 9: the feature description vector will be normalized
Figure BDA00025262704800000614
Inputting a classified full-connection layer;
Figure BDA00025262704800000615
in the formula, theta ∈ RkOutput vector representing the classified fully-connected layer, P ∈ Rk×3DAnd a weight parameter matrix for classifying the fully-connected layers is represented, and k represents the number of target classes.
Step 10: calculating the probability of the input image belonging to each category by combining a softmax function;
Figure BDA0002526270480000071
i=1,2,…,k
in the formula etaiRepresenting the probability that the input image belongs to the ith category.
Fig. 1 gives a general framework of the invention. Firstly, an original image is preprocessed and then input into a lightweight basic network ResNet-G to extract feature information, low-rank approximate polynomial nuclear pooling selects output activation of three different convolution layers to perform linear mapping, and three projection feature tensors with the same dimensionality are generated. Secondly, the first step is to carry out the first,and cross-layer cross-combining the projection features, measuring the interaction among the features by using tensor dot product operation, and further compressing the feature dimensions by combining global average pooling operation to obtain three D-dimensional cross-layer polynomial feature vectors. Then, all polynomial feature vector information is fused through feature cascade operation to generate fine-grained image feature description vectors, and element-by-element symbol root mean square normalization and L are used for normalization2And normalizing the feature description vectors by means of regularization and the like. And finally, inputting the standardized cross-layer polynomial feature description vector into a classified fully-connected layer, and calculating the image class probability by using a softmax function.
Fig. 2 is a schematic structural diagram of a novel residual error module based on multi-layer aggregate packet convolution operation, in which the dimensions of input and output data are set to be 256, the packet number g is 4, and the number m of output channels of each sub-convolution layer is 16. The novel residual error module divides feature extraction into two branches of convolution operation and identity mapping, and feature information is propagated in the two branches in parallel and finally converged and output in an element-by-element corresponding addition mode. In the convolution operation branch, firstly, dimension reduction is carried out on the convolution layer with the convolution kernel size of 1 multiplied by 1, the number of input channels of 256 and the number of output channels of 64, dimension reduction features are divided into 4 groups on the channel level, and the number of channels corresponding to each group of sub-features is 16. Secondly, generating corresponding copies of the 1 st group of features, wherein one path of direct current is used for subsequent feature cascade operation, the other path of direct current is superposed into the 2 nd group of features, and feature information is further extracted through 3 multiplied by 3 convolution with the input channel number of 16 and the output channel number of 16. And then, generating a corresponding copy again by outputting the characteristics, wherein one path of direct current is used for characteristic cascade connection, the other path of direct current is superposed with the 3 rd group of characteristics, and the convolution layer has the convolution kernel size of 3 multiplied by 3, the number of input channels of 16 and the number of output channels of 16. And analogizing until all 4 groups of characteristics pass through 3 multiplied by 3 convolution operation, and then cascading each group of direct current characteristics with the 4 th group of convolution characteristics to obtain 64-dimensional characteristic information. Finally, the features are restored to the original dimensions using a 1 × 1 convolution with an input channel number of 64 and an output channel number of 256.
FIG. 3 shows the polynomial order r and the projection dimension D of the low-rank approximate polynomial kernel pooling of the present inventionrFor the influence of identification accuracy, a comparison experiment is based on a CUB-200-2011 fine-grained image data set, a basic network ResNet-G with the grouping number G being 4 and the output channel number m being 18 of each sub-convolution layer is used as an image feature extractor, and the convolution layer res5_ c is used for outputting activation to approximately model an r-order homologous polynomial kernel function. Projection dimension D of the convolution module when polynomial of order rrWhen the model is changed from 512 to 32768, the identification accuracy of the model is improved. In particular, when r is 2, the projection dimension DrThe classification accuracy of the 512-corresponding model is about 83.0%, while DrThe classification accuracy corresponding to 32768 is about 86.3%, an increase of 3.3%. When projecting dimension DrWhen the accuracy of the identification model corresponding to r-2 is increased from 86.0% to 86.3% from 8192 to 32768, the accuracy is only increased by 0.3%, and the improvement of the model performance is limited. And DrThe polynomial eigenvector dimension corresponding to 32768 is DrThe original convolution characteristics are linearly mapped by a polynomial convolution module with the output channel number of 8192 in subsequent experiments, so that the balance between the identification accuracy and the calculation complexity is obtained. Furthermore, DrWhen the number of the polynomial kernel function is 2048, the accuracy of the identification model corresponding to the polynomial kernel function order r of 2 reaches 84.9%, the accuracy is increased by 1.8% compared with that of a linear SVM classifier, and therefore the low-rank approximate polynomial kernel pooling can effectively model fine-grained feature interaction, and the image feature description vector contains more discriminant information. And as the polynomial order r increases from 2 to 4, the recognition effect of the high-order polynomial kernel classifier is rather reduced. This is because the interaction between low-order features is more efficient and reliable. Therefore, the interactive information of the fine-grained image can be captured from the output feature tensor by using the polynomial kernel function with a relatively low order.
FIG. 4 visualizes the partial convolutional layer output features of the low-rank approximate polynomial kernel pooling of the present invention on the CUB-200-plus 2011, FGVC Aircraft, and Stanford Cars fine-grained image datasets. The characteristic response graph is obtained by calculating the average value of characteristic information in all channels, and the characteristic information corresponding to the projection layers proj5_ a, proj5_ b and proj5_ c is respectively generated by activating outputs of convolution layers res5_ a, res5_ b and res5_ c in the base network ResNet-G through a polynomial convolution module. As can be seen from the figure, for three types of fine-grained data sets, the method provided by the invention can automatically position the image to local key regions (white parts) with strong semantic and discriminability in the image by ignoring background interference, such as the head, the wing and the trunk of birds in the CUB-200-2011 data set, the cockpit, the engine and the tail stabilizer of an airplane in the FGVC Aircraft data set, and the bumper, the headlight, the wheels and the like of an automobile in the Stanford Cars data set. For analysis of a single test picture, convolutional layers res5_ a, res5_ b and res5_ c provide rough spatial position information of a target, and contain certain noise, projection layers proj5_ a, pro5j _ b and pro5j _ c are further refined and have certain bias on the basis, each key area in the target is located and extracted with features, then low-rank approximate polynomial kernel pooling models interaction among feature information of different key parts, potential relations among the local areas are mined and captured, a plurality of cross-layer interaction information is integrated to achieve image perception from the local to the whole, and the process is in accordance with human cognition. The method can autonomously position and sense each key part of the target in the fine-grained image, and can explain how to effectively and accurately capture the slight difference between different types of targets under the condition that the position of the target is not explicitly detected, so that better identification accuracy is obtained.
Table 1 lists the results of the ResNet-G experiments at different super parameter settings and compares them to ResNet-50, ResNext networks. The parameter g represents the grouping number of the 3 x 3 convolutional layers in the novel residual error module, the parameter m represents the number of output channels of each convolutional layer, the identification model directly connects the feature tensors extracted from different basic networks with the full-connection layer after global average pooling, and then the target class probability is calculated by utilizing a softmax function. According to the data in the table, ResNet-G can be found to effectively compress the storage space of the model by introducing a novel residual error module. It is worth noting that although the packet convolution breaks the connection between the spatial position of the feature tensor and the channel, it does not necessarily cause the network feature extraction capability to be reduced. In particular, the identification accuracy of the ResNet-G model corresponding to the superparameters G-4 and m-24 is 84.0%, and is even improved by 1.8% compared with the ResNet-50. The novel residual error module also uses a short circuit connection structure in the grouping convolution, on one hand, multi-scale and multi-level feature information can be fused, and on the other hand, each group of convolution can reduce information loss caused by decoupling of the grouping convolution on the feature space position relation and the channel relation by collecting all channel information of the previous group. ResNet-G sets the grouping number to be 4, when the input channel number is set to be 18, the recognition accuracy reaches 83.1 percent, which is 0.9 percent higher than ResNet-50, and the model storage space only occupies 68.8 percent of ResNet-50, and meanwhile, the calculated amount is reduced by nearly 30 percent. Therefore, in subsequent experiments and analysis, we will use this hyper-parameter setting to construct the image basis feature extractor. ResNext compensates for the information loss caused by packet convolution by increasing the number of input and output channels of the middle 3 x 3 convolutional layer, and thus the amount of parameters and computations of the overall network increases. Under the same superparameter setting, i.e., G4 and M24, the overall classification accuracy of ResNext is 83.4%, which is 0.6% less than that of ResNet-G, while the corresponding model memory space and computation are 90.16M and 16.9GFLOPs, respectively, which are 8.0% and 15.0% more than that of ResNet-G. Therefore, the expression capability of the features can be obviously enhanced under the condition that the parameter quantity of the convolution layer is not increased by fusing multi-scale information through short circuit connection in the grouping convolution, and the identification accuracy of the model is further improved. Meanwhile, the first group of feature tensors in the novel residual module can be directly subjected to feature cascade operation with the subsequent group of features without any convolution operation, so that the parameter quantity and the calculated quantity of the model can be reduced. In addition, from the analysis of the network structure and the expandability, ResNet-G can complete the construction of the whole network by stacking novel residual modules with the same topological structure, and the process only relates to the setting of two types of super parameters. However, the currently mainstream inclusion series lightweight networks include a large number of artificially set hyper-parameters, which need to be adjusted and modified according to data distribution, resulting in increased design burden. In conclusion, the ResNet-G basic network based on the novel residual error module has a remarkable performance in the aspects of architecture, feature learning and computational complexity.
TABLE 1 ResNet, ResNext, and ResNet-G basic network Performance comparison
Figure BDA0002526270480000091
Table 2 compares the complexity of the low-rank approximate polynomial kernel pooling of the present invention with the complexity of the recognition model classification layers corresponding to the other two pooling modes, where H, W, C and k represent the height, width, number of channels and number of classes of objects to be classified, respectively. The numbers in brackets are typical values calculated by applying three pooling methods for the CUB-200 and 2011 fine-grained image recognition task. Bilinear pooling of Bilinear CNN (B-CNN) to capture correlation between feature channels results in an increase in feature description vector dimension to C2For the k-200 classification task, the parameter amount of the fully-connected layer occupies 200MB of storage space. Boost-CNN improves the classification effect of the model by integrating 9B-CNNs, and if each base learner outputs C2Dimensional feature vector, Boost-CNN will generate 9. C in the training process2Dimensional data, the corresponding full link layer parameters, will occupy super gigabytes of storage space. When the projection dimension D is 8192, the identification accuracy of the corresponding model reaches 86.0%, which is higher than that of B-CNN by 2.0%, and the parameter quantity of the full connection layer is only 3% of that of B-CNN, and meanwhile, the calculation quantity is obviously reduced. CBP adopts Tensor Sketch (TS) to carry out low-dimensional approximation on matrix outer product operation in bilinear pooling, and simulation results show that 8192-dimensional TS characteristics have the same characterization capability as 262K-dimensional bilinear characteristics, and the parameter number of classification layers is compressed by nearly 96.5%. It is noteworthy that since the CBP pooling process involves fast fourier transform FFT, its operation speed is rather slow. When an 448 x 448 pixel image is input, CBP runs take 5.03ms while bilinear pooling only takes 0.77 ms. Therefore, the low-rank approximate polynomial kernel pooling well realizes model compression and acceleration under the condition of not reducing the feature expression capacity.
TABLE 2 computational complexity contrast for multiple pooling modes
Figure BDA0002526270480000101
Table 3 compares the lightweight fine-grained image recognition method of cross-layer feature interaction in the weakly supervised scene related to the present invention with the mainstream fine-grained recognition method. By observing data in the table, the comprehensive performance of the cross-layer feature interaction lightweight fine-grained image identification method in the aspects of identification accuracy, model complexity and technical feasibility is at the leading level at present. Two-Level is a fine-grained image recognition model based on weak supervision information, object-Level and part-Level classifiers are respectively trained by the model by utilizing image-Level labels and spatial position information of target key parts, the parameter amount reaches 138.4M, which is 2.05 times of that of the invention, and the recognition accuracy on a CUB-200-plus-2011 data set is only 75.7%, which is 12.2% less than that of the invention. The PN-CNN is used as a strong supervision fine-grained identification model, a posture alignment operation is added on the basis of Part-based RCNN, and the identification accuracy of 85.4% is obtained on a CUB-200-plus-2011 data set. The PN-CNN utilizes three groups of AlexNet networks which are independent to each other to extract the characteristics of the whole target region, the head region and the trunk region, and fuses a plurality of groups of characteristics in a cascading mode to enable the final characteristic description vector to simultaneously contain the information of the target region and the local key region. The parameters of the three sets of basic networks are different, and each network comprises a separate full connection layer, so that the overall model parameter quantity is increased to 173.0M. Under the same thought, the Mask-CNN also adopts a plurality of groups of mutually independent basic sub-networks to sense the whole and local feature information of the target, and is different from the PN-CNN in that the Mask-CNN firstly performs global average pooling and maximum pooling after obtaining the convolution features, thereby reducing feature dimensions, and then predicts the target category by combining the feature information of all the sub-networks through feature cascading. This operation can significantly reduce the number of parameters and the amount of computation of the fully connected layer. Compared with PN-CNN, the Mask-CNN recognition accuracy rate of the VGG16 basic network reaches 85.7%, is improved by 0.3%, the model parameter number is only 60.5M, and the compression is nearly 65.0%. The identification accuracy of the method is 87.9 percent, which is 0.6 percent higher than that of Mask-CNN based on ResNet-50, but the model parameters only account for 77.8 percent of the Mask-CNN. In addition, the Mask-CNN needs additional spatial position information of key parts in addition to the image-level category labels in the training stage, so that the Mask-CNN is inferior to the Mask-CNN in the aspects of identification accuracy, computational complexity and technical feasibility compared with the method. RA-CNN is a cyclic self-attention fine-grained identification model which is composed of a triple network, wherein each sub-network comprises a classification module and an attention recommendation APN module. RA-CNN continuously enlarges local areas through APN, so that the model is gradually gathered to a target key part in the training process, and recognition accuracy rates of 85.3%, 88.2% and 92.5% are respectively obtained on CUB-200-plus 2011, FGVC air craft and Stanford Cars data sets, which are reduced by 2.6%, 3.7% and 1.6% compared with the method. RA-CNN contains a triple network and is trained serially, resulting in model parameters as high as 429.0M, 6.36 times that of the present invention. MA-CNN is a weakly supervised fine-grained recognition model built on top of a single underlying network. The MA-CNN utilizes the channel grouping module to autonomously generate attention areas and extract corresponding characteristic information, and inputs each attention area into a separate full-connection layer. The MA-CNN parameter reaches 144M when the attention area is 4, which is 2.14 times of the invention. Meanwhile, the MA-CNN trains the channel grouping module and the classification module alternately, the training mode is more complicated and is easy to fall into a local optimal solution, and the model parameters can be updated by adopting an end-to-end training mode.
TABLE 3 comparison of the Performance of the present invention with classical fine-grained image recognition models
Figure BDA0002526270480000111
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention in any way, but any modifications or equivalent variations made according to the technical spirit of the present invention are within the scope of the present invention as claimed.

Claims (3)

1. The method for recognizing the lightweight fine-grained image of cross-layer feature interaction in the weak supervision scene is characterized by comprising the following steps of:
(1) in the preprocessing stage, an original image with any size is uniformly scaled to 600 × 600 pixels, a 448 × 448 pixel region is cut out on the basis of taking the center of the image as an origin, the cut region is normalized according to the mean value [0.485,0.456,0.406] and the standard deviation [0.229,0.224,0.225], and then the normalized image is input into a lightweight basic feature extraction network ResNet-G;
(2) the feature tensors output by the input image through three different convolution layers of the lightweight basic network ResNet-G are respectively
Figure FDA0003497159460000011
And
Figure FDA0003497159460000012
wherein Hi、WiAnd CiI is 1,2 and 3 respectively represent the height, width and channel number of the convolution characteristic;
(3) x, Y and Z are passed in parallel through three convolution kernels of size 1X 1 and step size SiThe number of input channels is CiAnd independent linear mapping is carried out by a polynomial convolution module with the output channel number of D to generate a projection characteristic tensor
X,Y,Z∈RH×W×D
In the formula, H and W respectively represent the height and width of the projection feature, D represents the projection dimension, and the convolution step SiOutputting the height H of the feature tensor from each convolution layeriCalculating the height H of the projected feature tensor to obtain;
(4) modeling the interactive information among the projection characteristics of each convolution layer by means of a low-rank approximate polynomial kernel pooling mode, and taking the polynomial kernel order r as 2 to ensure that the linear classifier has the discrimination capability equivalent to that of a second-order polynomial kernel classifier, then
Figure FDA0003497159460000013
Figure FDA0003497159460000014
Figure FDA0003497159460000015
In the formula (I), the compound is shown in the specification,
Figure FDA0003497159460000016
the operation of the dot product of the tensor is represented,
Figure FDA0003497159460000017
and
Figure FDA0003497159460000018
a second-order polynomial feature tensor representing a cross-layer;
(5) aggregating the feature information of all spatial positions in each channel of the second-order polynomial feature tensor by using global average pooling operation to obtain a polynomial feature vector, thereby further compressing the dimensionality of the feature vector;
Figure FDA0003497159460000019
in the formula (I), the compound is shown in the specification,
Figure FDA00034971594600000110
and
Figure FDA00034971594600000111
respectively representing corresponding feature tensors
Figure FDA00034971594600000112
And
Figure FDA00034971594600000113
the cross-layer second-order polynomial feature vector of (1), (2), (…), and (HW) represents the set of all spatial positions of the feature map;
(6) converging all cross-layer polynomial feature vectors through feature cascade, and outputting fine-grained image feature description vectors;
Figure FDA00034971594600000114
(7) normalizing the image feature description vector by using element-by-element symbol root mean square normalization;
Figure FDA0003497159460000021
(8) using L2Normalizing the image feature description vectors by regularization;
Figure FDA0003497159460000022
(9) the feature description vector will be normalized
Figure FDA0003497159460000023
Inputting a classified full-connection layer;
Figure FDA0003497159460000024
in the formula, theta ∈ RkOutput vector representing the classified fully-connected layer, P ∈ Rk×3DRepresenting a weight parameter matrix of the classified fully-connected layer, wherein k represents the number of target categories;
(10) calculating the probability of the input image belonging to each category by combining a softmax function;
Figure FDA0003497159460000025
in the formula etaiRepresenting the probability that the input image belongs to the ith category.
2. The method for recognizing the cross-layer feature interaction lightweight fine-grained image under the weak supervision scene according to claim 1, characterized in that the lightweight basic feature extraction network ResNet-G used in the step (1) is a network structure formed by replacing an original Bottleneck residual module with a deep residual network ResNet-50 frame by a novel residual module based on a multilayer aggregation grouping convolution operation mode, the novel residual module divides feature extraction into two branches of convolution operation and identity mapping, feature information is propagated in parallel in the two branches and finally converged and output in a way of element-by-element corresponding addition, and meanwhile, the novel residual module partially decouples a spatial relationship and a channel relationship synchronous learning mode of a feature tensor of conventional convolution in the branch of convolution operation, thereby simplifying convolution operation, setting input and output data dimensions to be 256, the dimensionality of the middle layer is 64, the grouping number is g, the number of output channels of each group of sub convolution layers is m, and the convolution operation branch of the novel residual error module comprises the following specific steps:
(1) reducing the dimension of the characteristic by using a convolution layer with the convolution kernel size of 1 multiplied by 1, the number of input channels of 256 and the number of output channels of 64;
(2) dividing the dimensionality reduction feature tensor into g groups on the channel level, wherein the number of channels corresponding to each group of sub-features is equal to the number of channels
Figure FDA0003497159460000026
Simultaneously numbering all the sub-feature groups;
(3) generating corresponding copy information by the 1 st group of characteristics, using one path of direct current for subsequent characteristic cascade operation, superposing the other path of direct current into the 2 nd group of characteristics, and inputting the number of channels as
Figure FDA0003497159460000027
The number of output channels is
Figure FDA0003497159460000028
Further extracting characteristic information by the 3 x 3 convolution;
(4) the output characteristics generate corresponding copy information again, one path of direct current is used for characteristic cascade, and the other path is connected with the 3 rd groupThe features are superimposed by a convolution kernel size of 3 x 3 with an input channel number of
Figure FDA0003497159460000031
The number of output channels is
Figure FDA0003497159460000032
The convolutional layer of (1);
(5) analogizing in sequence until all the g groups of characteristics pass through 3 multiplied by 3 convolution operation, and then cascading each group of direct current characteristics with the g-th group of convolution characteristics to obtain 64-dimensional characteristic information;
(6) a 1 x 1 convolution with an input channel number of 64 and an output channel number of 256 is used to restore the 64-dimensional features to the original dimensions 256.
3. The method for identifying the lightweight fine-grained image of cross-layer feature interaction under the weak supervision scene according to claim 1, wherein in the step (4), the pooling of the low-rank approximate polynomial kernel is performed by fitting a homopolynomial kernel function in a support vector machine through a tensor decomposition idea, so that a linear classifier has the discrimination capability equivalent to that of a high-order polynomial kernel classifier, specifically, the pooling of the low-rank approximate polynomial kernel is performed to obtain the performance of an r-order polynomial kernel classifier, and firstly, the sizes of r convolution kernels are combined to be 1 x 1, the number of input channels is C, and the number of output channels is DrThe independent convolution layers form an r-order polynomial convolution module; then inputting the preprocessed fine-grained image into a convolution feature tensor X epsilon obtained by a lightweight basic network ResNet-GH×W×CPerforming linear mapping through an r-order polynomial convolution module to generate a projection characteristic tensor set
Figure FDA0003497159460000033
Wherein
Figure FDA0003497159460000034
Finally, combining r projected feature tensors by adopting tensor dot product operation to obtain high-order statistical information of features
Figure FDA0003497159460000035
In the formula (I), the compound is shown in the specification,
Figure FDA0003497159460000036
the operation of the dot product of the tensor is represented,
Figure FDA0003497159460000037
approximate r-order statistics representing the convolution characteristics X of the underlying network output.
CN202010505152.2A 2020-04-21 2020-06-05 Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene Active CN111652236B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN2020103172058 2020-04-21
CN202010317205 2020-04-21

Publications (2)

Publication Number Publication Date
CN111652236A CN111652236A (en) 2020-09-11
CN111652236B true CN111652236B (en) 2022-04-29

Family

ID=72347337

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010505152.2A Active CN111652236B (en) 2020-04-21 2020-06-05 Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene

Country Status (1)

Country Link
CN (1) CN111652236B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112101547B (en) * 2020-09-14 2024-04-16 中国科学院上海微系统与信息技术研究所 Pruning method and device for network model, electronic equipment and storage medium
CN112183602B (en) * 2020-09-22 2022-08-26 天津大学 Multi-layer feature fusion fine-grained image classification method with parallel rolling blocks
CN112288028A (en) * 2020-11-06 2021-01-29 神思电子技术股份有限公司 Image identification method based on stream convolution
CN112465118B (en) * 2020-11-26 2022-09-16 大连理工大学 Low-rank generation type countermeasure network construction method for medical image generation
CN112381061B (en) * 2020-12-04 2022-07-12 中国科学院大学 Facial expression recognition method and system
CN112686242B (en) * 2020-12-29 2023-04-18 昆明理工大学 Fine-grained image classification method based on multilayer focusing attention network
CN112507982B (en) * 2021-02-02 2021-05-07 成都东方天呈智能科技有限公司 Cross-model conversion system and method for face feature codes
CN113222998B (en) * 2021-04-13 2022-05-31 天津大学 Semi-supervised image semantic segmentation method and device based on self-supervised low-rank network
CN113240659B (en) * 2021-05-26 2022-02-25 广州天鹏计算机科技有限公司 Heart nuclear magnetic resonance image lesion structure extraction method based on deep learning
CN113327284B (en) * 2021-05-27 2022-08-26 北京百度网讯科技有限公司 Image recognition method and device, electronic equipment and storage medium
CN113441411A (en) * 2021-07-31 2021-09-28 北京五指术健康科技有限公司 Rubbish letter sorting equipment based on augmented reality
CN113343991B (en) * 2021-08-02 2023-06-09 四川新网银行股份有限公司 Weak supervision learning method with enhanced characteristics
CN114745465A (en) * 2022-03-24 2022-07-12 马斌斌 Interactive noise self-prior sensing analysis system for smart phone
CN116503671B (en) * 2023-06-25 2023-08-29 电子科技大学 Image classification method based on residual network compression of effective rank tensor approximation

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359684A (en) * 2018-10-17 2019-02-19 苏州大学 Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement
CN109409384A (en) * 2018-09-30 2019-03-01 内蒙古科技大学 Image-recognizing method, device, medium and equipment based on fine granularity image
CN110147841A (en) * 2019-05-22 2019-08-20 桂林电子科技大学 The fine grit classification method for being detected and being divided based on Weakly supervised and unsupervised component
CN110378356A (en) * 2019-07-16 2019-10-25 北京中科研究院 Fine granularity image-recognizing method based on multiple target Lagrange canonical
CN110689091A (en) * 2019-10-18 2020-01-14 中国科学技术大学 Weak supervision fine-grained object classification method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109409384A (en) * 2018-09-30 2019-03-01 内蒙古科技大学 Image-recognizing method, device, medium and equipment based on fine granularity image
CN109359684A (en) * 2018-10-17 2019-02-19 苏州大学 Fine granularity model recognizing method based on Weakly supervised positioning and subclass similarity measurement
CN110147841A (en) * 2019-05-22 2019-08-20 桂林电子科技大学 The fine grit classification method for being detected and being divided based on Weakly supervised and unsupervised component
CN110378356A (en) * 2019-07-16 2019-10-25 北京中科研究院 Fine granularity image-recognizing method based on multiple target Lagrange canonical
CN110689091A (en) * 2019-10-18 2020-01-14 中国科学技术大学 Weak supervision fine-grained object classification method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多尺度特征融合的细粒度图像分类;李思瑶等;《激光与光电子学进展》;20191217;全文 *

Also Published As

Publication number Publication date
CN111652236A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
CN111652236B (en) Lightweight fine-grained image identification method for cross-layer feature interaction in weak supervision scene
Yu et al. Hierarchical bilinear pooling for fine-grained visual recognition
Lin et al. Transfer learning based traffic sign recognition using inception-v3 model
Wang et al. Deep learning algorithms with applications to video analytics for a smart city: A survey
CN106599797B (en) A kind of infrared face recognition method based on local parallel neural network
Ji et al. 3D convolutional neural networks for human action recognition
Su et al. Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories
US10776628B2 (en) Video action localization from proposal-attention
Aurangzeb et al. Human behavior analysis based on multi-types features fusion and Von Nauman entropy based features reduction
CN111460980B (en) Multi-scale detection method for small-target pedestrian based on multi-semantic feature fusion
US11223782B2 (en) Video processing using a spectral decomposition layer
US20220156528A1 (en) Distance-based boundary aware semantic segmentation
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
Singh et al. Image Understanding-a Brief Review of Scene Classification and Recognition.
Leibe et al. Learning semantic object parts for object categorization
Munoz Inference Machines Parsing Scenes via Iterated Predictions
CN114596589A (en) Domain-adaptive pedestrian re-identification method based on interactive cascade lightweight transformations
CN114998638A (en) Multi-view three-dimensional point cloud classification method based on dynamic and static convolution fusion neural network
Faruque et al. Vehicle classification in video using deep learning
Wang et al. Vehicle type classification via adaptive feature clustering for traffic surveillance video
Yang et al. Real-time pedestrian detection for autonomous driving
Özyurt et al. A new method for classification of images using convolutional neural network based on Dwt-Svd perceptual hash function
Sang et al. Image recognition based on multiscale pooling deep convolution neural networks
Wang et al. Robust person head detection based on multi-scale representation fusion of deep convolution neural network
Guo et al. Facial expression recognition: a review

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant