CN112418351B - Zero sample learning image classification method based on global and local context sensing - Google Patents

Zero sample learning image classification method based on global and local context sensing Download PDF

Info

Publication number
CN112418351B
CN112418351B CN202011460544.8A CN202011460544A CN112418351B CN 112418351 B CN112418351 B CN 112418351B CN 202011460544 A CN202011460544 A CN 202011460544A CN 112418351 B CN112418351 B CN 112418351B
Authority
CN
China
Prior art keywords
global
feature
local
feature map
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011460544.8A
Other languages
Chinese (zh)
Other versions
CN112418351A (en
Inventor
王国威
陶文源
管乃洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianjin University
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN202011460544.8A priority Critical patent/CN112418351B/en
Publication of CN112418351A publication Critical patent/CN112418351A/en
Application granted granted Critical
Publication of CN112418351B publication Critical patent/CN112418351B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Probability & Statistics with Applications (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a zero sample learning image classification method based on global and local context sensing, which comprises the following steps: carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map; calculating any layer of feature map by using global attention to obtain a feature map containing global information; calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information; obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors; splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic space and a hidden feature space, and performing parameter optimization by respectively adopting softmax loss and triple loss; and repeating the steps, setting a plurality of periods for training to obtain a zero sample learning model with strong representation capacity, and classifying the images through the trained zero sample learning model.

Description

Zero sample learning image classification method based on global and local context sensing
Technical Field
The invention relates to the field of image classification, in particular to a zero sample learning image classification method based on global and local context sensing.
Background
Deep learning techniques have evolved rapidly, and their related applications have been practiced in a number of fields (computer vision, natural language processing, etc.), since deep learning can utilize massive amounts of data for model training and thus achieve powerful recognition capabilities. However, the training sample may not cover all of the classes. In particular, for existing data, it is also inherently subject to long tail distributions, which means that only a few common classes can provide a large number of samples, while the most uncommon classes can collect a very limited amount of samples. The phenomenon is reflected in deep learning, that is, the deep learning model can achieve ideal recognition accuracy for common classes due to the fact that the training samples are abundant, but for uncommon classes, the recognition capability of the model is different from that of the former models in nature. In particular, for classes for which no training samples are collected, the recognition capability is zero. However, the model is applied in reality, and not only needs to obtain strong recognition capability from the collected data, but also needs to have recognition capability when a brand new category without any training sample appears. New categories such as new species and new models of electronic equipment are generated every day in the world, the identification capability of unseen categories can be realized, the key turning of development of deep learning systems to date is realized, and the task of identifying unseen categories can be solved through zero-sample learning.
Zero-sample learning is a deep learning technique that mimics the ability of the human brain to recognize, and Lambert states that humans can recognize perhaps 30,000 fundamental classes, as well as fine-grained subclasses of these classes. In addition to identifying the categories that have been seen and using this knowledge to identify fine-grained sub-categories, humans can identify entirely new categories or concepts, such as those that can be accurately identified when they first see zebra, as expressed by the expression "look similar to horses, with black and white stripes".
In the zero-sample learning image classification task, the model can only use the images from the known classes, but can identify the classes to which the images from the unknown classes belong, so that the task of identifying the unknown classes can be completed, because a high-level semantic indication for describing the characteristics of the object, such as attributes, is used, and the unknown classes and the known classes are linked by assuming that the known classes and the unknown classes share all the attributes. Generally speaking, the zero sample learning step is as follows, in the training phase, the model learns a visual-semantic mapping, in the inference phase, for an image of unknown class, firstly, the image is converted into the form of semantic vector by using the mapping relation learned in the previous step, then, the semantic vector is compared with the real attribute vector of unknown class, and the closest class is selected as the prediction result.
Existing zero sample learning algorithms can be classified into two categories, one being model-based algorithms and the other being compatibility-based algorithms, depending on whether new training data is generated during the training phase. The first type of algorithm generates images according to semantic description of unknown classes, and trains the images together with the existing known class images by adopting a traditional deep learning mode. However, the existing methods have a plurality of defects, such as that the generated unknown class images cannot well restore the details and the generated unknown class characteristics have no interpretability. Such methods ignore the importance of information-rich visual regions in the image. The second category of methods directly uses semantic knowledge to learn a visual-semantic mapping relationship by aligning the visual space with the semantic space. Most models based on the compatibility method focus on how to mine the discriminative local information that the object itself has, and how to better align two different spaces. However, the forward contribution of global information to the zero sample learning task is ignored.
Disclosure of Invention
The invention provides a zero sample learning image classification method based on global and local context sensing, which considers global features and local features at the same time, enhances the learned mapping expression capability, and further improves the performance of a zero sample learning model, as described in detail in the following:
a zero sample learning image classification method based on global and local context sensing comprises the following steps:
performing feature extraction on the image by using a deep neural network to obtain a multilayer feature map;
calculating any layer of feature map by using global attention to obtain a feature map containing global information; calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information;
obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors;
splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic space and an implicit feature space, and respectively adopting softmax loss and triple loss to carry out parameter optimization;
and repeating the steps, setting a plurality of periods for training to obtain a zero sample learning model with strong representation capacity, and classifying the images through the trained zero sample learning model.
The calculating any layer of feature map by using global attention to obtain the feature map containing global information specifically includes:
obtaining a spatial self-attention module weight matrix, and using the obtained weight matrix to the eigenvalue
Figure BDA0002831416280000021
Weighting is effected to obtain a weighted value>
Figure BDA0002831416280000022
Based on the weighted characteristic, a residual linking method is adopted to add->
Figure BDA0002831416280000023
Get>
Figure BDA0002831416280000024
Will obtain
Figure BDA0002831416280000025
Is re-dimensioned to be as large as the original feature map, is>
Figure BDA0002831416280000026
Will->
Figure BDA0002831416280000027
The global context information is transmitted to the last layer by taking the same operation in the multi-layer characteristic diagram.
Further, the spatial self-attention module weight matrix is specifically:
Figure BDA0002831416280000031
wherein the content of the first and second substances,
Figure BDA0002831416280000032
dimensional information, softmax, representing variables col In and/or in>
Figure BDA0002831416280000033
Transposing of query features for re-dimension, <' >>
Figure BDA0002831416280000034
For re-dimensional key features, T is transposed, L = H × W is the product of the length and the width of the characteristic diagram, is->
Figure BDA0002831416280000035
Is a re-dimensioned feature map.
Wherein the weighted values
Figure BDA0002831416280000036
Comprises the following steps:
Figure BDA0002831416280000037
Figure BDA0002831416280000038
wherein, alpha is a balance factor; c is the number of channels of the characteristic diagram,
Figure BDA0002831416280000039
to a re-dimensioned feature map.
Further, the obtaining of the feature vector representing the local information by using the local attention to calculate the feature map of the same layer is specifically:
calculating by a space converter and carrying out matrix multiplication with an original characteristic diagram to obtain a plurality of corresponding region Rs, and extracting characteristics by adopting an initiation for each region Rs:
processing the IR by adopting global maximum pooling and global average pooling on the extracted features; processing the IR' obtained from the plurality of areas by adopting element-by-element addition to obtain the characteristics which finally represent the local area; and respectively learning visual-semantic mapping and visual-implicit mapping, and splicing.
The technical scheme provided by the invention has the beneficial effects that:
1. the method leads the model to be more adaptive to a zero sample learning classification task by directly training the sample of the original image;
2. according to the invention, the global attention module is adopted to extract global context information from the original feature map to generate the feature map containing global information, the global features extracted by the model have strong expression capability, and the global understanding of the model to the object is enhanced;
3. according to the method, a local attention module is adopted to extract local context information of an original characteristic diagram to obtain local characteristic vectors, the same steps are adopted for a plurality of characteristic diagrams, and finally the plurality of local characteristic vectors are summed to obtain complete local characteristic vectors, so that the local understanding of the model to the object is enhanced;
4. according to the method, complete feature expression is obtained by adopting a feature splicing mode, both global information and local information are considered, the representation capability of the model is greatly improved, and the model precision is improved;
5. according to the method, the scheme of projecting image features to a semantic space and a hidden space at the same time is adopted, and softmax loss and triplet loss are respectively adopted to optimize and update parameters.
Drawings
FIG. 1 is a flow chart of a zero sample learning image classification method based on global and local context awareness;
FIG. 2 is a schematic diagram of a global attention module;
FIG. 3 is a schematic diagram of a space transformer;
fig. 4 is a schematic diagram of an initiation network.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.
Example 1
A zero sample learning image classification method based on global and local context sensing, referring to FIG. 1, comprises the following steps:
101: carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map;
102: calculating any layer of feature map by using global attention to obtain a feature map containing global information;
103: calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information;
104: repeating the operations of steps 102 and 103 for multiple layers to obtain a plurality of global feature maps and local feature vectors;
105: obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors;
106: splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic (attribute) space and an implicit feature space simultaneously, and performing parameter optimization by respectively adopting softmax loss and triple loss;
107: and repeating the steps, setting a plurality of periods for training, finally obtaining a zero sample learning model with strong representation capability, and classifying the images through the trained zero sample learning model.
In summary, in the embodiment of the present invention, the deep neural network is used to calculate the feature maps extracted from the image through global attention, so as to obtain new feature maps containing global information, and local features are obtained by calculating local attention for each feature map; calculating a plurality of groups of feature maps, finally performing feature fusion, and projecting the fused features to a semantic (attribute) space and a hidden feature space simultaneously; by the method, the learned features are enhanced, the expression ability of the learned mapping is improved, and the classification accuracy of the model is improved.
Example 2
The scheme of example 1 is further described below with reference to specific calculation formulas and examples, which are described in detail below:
first, the basic setup is introduced:
training set
Figure BDA0002831416280000051
Containing Ns samples, wherein>
Figure BDA0002831416280000052
The ith image, representing a known class s>
Figure BDA0002831416280000053
Is its corresponding class label. Test set>
Figure BDA0002831416280000054
Contains Nu samples, wherein->
Figure BDA0002831416280000055
The jth sample, representing an unknown class u>
Figure BDA0002831416280000056
Is its corresponding class label. The semantic features of the known class and the unknown class can be represented as:
Figure BDA0002831416280000057
and & ->
Figure BDA0002831416280000058
The known class and the unknown class are disjoint,
Figure BDA0002831416280000059
Y s ∪Y u = Y. Using phi (x) = theta (x) T W represents the projection of the visual features in the semantic space, wherein theta (x) is the visual features extracted by the deep neural network, W represents a conversion matrix, and T represents transposition. σ (x) represents the projection of the visual feature in the hidden space.
In zero-sample learning, the training phase can only use known class images and semantic features (attributes), and the model needs to obtain the capability of predicting unknown classes by learning visual-semantic mapping or visual-implicit feature mapping.
1. Global context information extraction
Convolutional layers are important components of deep neural networks, but are limited by the size of their convolutional kernels, so that the features extracted by deep neural networks inevitably contain only local information. However, for computer vision tasks such as image classification, image segmentation and object detection, extracting more global features is the key to improve the model characterization capability. If global information can be introduced into some layers, the dilemma limited by the size of a convolution kernel can be relieved, and the performance of the deep neural network is improved. It is critical to be able to extract global information from the image.
The global self-attention module is initially used in natural language processing tasks and subsequently widely applied in computer vision tasks. Specifically, global self-attention can be gained by:
for an input profile, X ∈ R C×H×W Firstly, a set of convolution operations is adopted, the size of a convolution kernel is 1 x 1, and query characteristics Q, key characteristics K and value characteristics are generated
Figure BDA00028314162800000510
And a re-dimension feature->
Figure BDA00028314162800000513
Wherein Q, K ∈ R C′×H×W C' denotes the number of reduced feature map channels, based on the comparison result, is selected>
Figure BDA00028314162800000512
L = H × W, R represents dimension information of a variable, C represents the number of channels of the feature map, H represents the length of the feature map, and W represents the width of the feature map. />
Then re-dimension Q and K to obtain
Figure BDA0002831416280000061
Then the spatial self-attention module weight matrix at this time can be expressed as:
Figure BDA0002831416280000062
then using the obtained weight matrix to make an eigenvalue pair
Figure BDA0002831416280000063
Weighting to obtain: (2)
Figure BDA0002831416280000064
Wherein α is a balance factor.
In order to prevent the loss of the original information, a residual error chaining mode is adopted, and the method is added on the basis of the weighting characteristic
Figure BDA0002831416280000065
Obtaining:
Figure BDA0002831416280000066
finally, will obtain
Figure BDA0002831416280000067
Is re-dimensioned to the same size as the original feature map, is>
Figure BDA0002831416280000068
Will->
Figure BDA0002831416280000069
The global context information can be transferred to the last layer by taking the same action at multiple layers of feature maps as a new feature map input to the next layer of neural network.
2. Local context information extraction
The local attention module also uses a layer of feature map X ∈ R C×H×W As input, a local feature vector Z ∈ R is output k×1 Wherein the k value is consistent with the dimension size of the attribute feature. The module consists of three sub-modules, namely a space transformer, an initiation and a global max/average pooling. The spatial transformer can be represented as a function ST (-) whose role is to help the network linearity learn the spatial invariance and the translational invariance and extend its range to all affine transformations or nonradiative transformations. This means that the spatial transformer can learn a transformation that can rectify the object that has undergone the affine transformation:
Figure BDA00028314162800000610
wherein,(t x ,t y ) Representing two-dimensional spatial coordinates (r) h ,r w ) Representing the scale transformation factor, l corresponds to the characteristic diagram of the l-th layer. And obtaining a plurality of corresponding regions by calculating through a space converter and carrying out matrix multiplication on the regions and the original characteristic diagram:
Rs=ST l (X) (5)
for each extracted region R, extracting the characteristics by using the inference:
IR=Inception(Rs) (6)
then processing the IR by respectively adopting global maximum pooling and global average pooling for the extracted features:
IR l =GAP(IR)+GMP(IR) (7)
the features obtained at this time encode important information of the local area. For the IR's obtained from multiple regions, they are processed by element-by-element addition, to obtain the features that ultimately represent local regions,
Figure BDA0002831416280000071
the model needs to learn two mappings, namely visual-semantic mapping and visual-implicit mapping, which respectively correspond to the two mapping matrixes W a And W b For computational convenience, Z is self-stitched such that its dimension is 2k.
3. Visual-semantic mapping and visual-latent mapping
Dividing the deep neural network into a plurality of layers of feature maps according to different receptive field sizes, extracting global context information from the feature maps by using a global attention module to obtain new feature maps to replace the original feature maps as the input of the next layer of the network, wherein the feature vectors obtained in the last layer contain the global context information. And then the last layer of feature vectors are projected to a semantic space and a hidden space through a full link layer, so that two kinds of mapping, namely visual-semantic mapping and visual-hidden mapping, are generated respectively. And performing parameter optimization by adopting a softmax loss function for visual-semantic mapping, and performing optimization by adopting a triple loss function for visual-implicit mapping. This has the advantage that both the interpretability of the attribute is preserved and the identifiability of the hidden attribute is taken into account.
For visual-semantic mapping, order
Figure BDA0002831416280000072
Being a semantic feature of category y, its compatibility score can be expressed as:
Figure BDA0002831416280000073
wherein, theta x Representing a visual feature, W a Representing the visual-semantic mapping matrix to be learned. Considering the compatibility score s as logits in softmax, then sotfmax loss can be expressed as:
Figure BDA0002831416280000074
wherein the content of the first and second substances,
Figure BDA0002831416280000075
for visual-implicit mapping, triple loss is adopted to minimize intra-class distance and maximize inter-class distance, so as to obtain implicit features with discriminativity:
Figure BDA0002831416280000076
wherein x is i ,x j ,x k Respectively representing anchor point, positive class and negative class samples, mrg representing separation distance and set to 1.0. Combining the visual-semantic mapping, the visual-implicit mapping and the loss function of the clipping network, the overall loss function can be expressed as:
L=L att +αL lat (13)
where α is a balance factor and is set to 1.0.
4. Zero sample learning prediction
Since the visual-semantic mapping and the visual latent feature mapping are learned simultaneously in the training phase, in the testing phase, correspondingly, for the case of the visual-semantic mapping, a test image x is given, whose projection in the semantic space is phi (x), with the goal of assigning it a class label:
Figure BDA0002831416280000081
for visual-latent feature mapping, image x is tested, its projection in semantic space is σ (x), and the mean of the prototypes of the known class of latent features is:
Figure BDA0002831416280000082
for an unseen class u, its relationship in semantic space to all the seen classes is first computed:
Figure BDA0002831416280000083
suppose that unseen class u shares a relationship in hidden space that is consistent with the semantic space:
Figure BDA0002831416280000084
the prediction of the entire blend can be expressed as,
Figure BDA0002831416280000085
where s (·, ·) is a compatibility function.
The parameters and the meanings of English abbreviations are as follows:
Figure BDA0002831416280000086
Figure BDA0002831416280000091
in the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited, as long as the device can perform the above functions.
Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-described embodiments of the present invention are merely provided for description and do not represent the merits of the embodiments.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (3)

1. A zero sample learning image classification method based on global and local context sensing is characterized by comprising the following steps:
1) Carrying out feature extraction on the image by using a deep neural network to obtain a multilayer feature map;
2) Calculating any layer of feature map by using global attention to obtain a feature map containing global information;
3) Calculating the characteristic diagram of the same layer by using local attention to obtain a characteristic vector representing local information;
4) Repeating the operations of the step 2) and the step 3) for multiple layers to obtain a plurality of global feature maps and local feature vectors;
5) Obtaining a global feature vector from the last layer of global feature map through a full connection layer; performing element-by-element addition on the multiple groups of local feature vectors to obtain complete local feature vectors;
splicing the complete local feature vector and the global feature vector, projecting the complete local feature vector and the global feature vector to a semantic space and a hidden feature space, and performing parameter optimization by respectively adopting softmax loss and triple loss;
repeating the steps, setting a plurality of periods for training to obtain a zero sample learning model with strong representation capability, and classifying the images through the trained zero sample learning model;
the calculating any layer of feature map by using global attention to obtain the feature map containing global information specifically comprises:
obtaining a spatial self-attention module weight matrix, and using the obtained weight matrix to the eigenvalue
Figure FDA0003924606150000011
Weighting is carried out to obtain a weighted value->
Figure FDA0003924606150000012
Based on the weighted characteristic, a residual linking method is adopted to add->
Figure FDA0003924606150000013
Get->
Figure FDA0003924606150000014
Will obtain
Figure FDA0003924606150000015
Is re-dimensioned to be as large as the original feature map, is>
Figure FDA0003924606150000016
Will->
Figure FDA0003924606150000017
Inputting the feature map as a new feature map into a next layer of neural network, adopting the same operation in a plurality of layers of feature maps, and transmitting the global context information to the last layer;
the method for calculating the feature map of the same layer by using local attention to obtain the feature vector representing the local information specifically comprises the following steps:
calculating by a space converter and carrying out matrix multiplication with an original characteristic diagram to obtain a plurality of corresponding region Rs, and extracting characteristics by adopting an initiation for each region Rs:
processing the IR by adopting global maximum pooling and global average pooling on the extracted features; processing the IR' obtained from the plurality of areas by adopting element-by-element addition to obtain the characteristics which finally represent the local area; and respectively learning visual-semantic mapping and visual-implicit mapping, and splicing.
2. The method according to claim 1, wherein the spatial self-attention module weight matrix is specifically:
Figure FDA0003924606150000018
wherein the content of the first and second substances,
Figure FDA0003924606150000021
dimension information, softmax, representing variables col For the calculation of the softmax score column by column for the matrix, ->
Figure FDA0003924606150000022
Transposing for re-dimension query features, <' > in>
Figure FDA0003924606150000023
To re-dimension a key feature, T is transposed, and L = H × W is the product of the length and width of the feature map.
3. The zero sample learning image classification method based on global and local context awareness according to claim 2,
the weighting values
Figure FDA0003924606150000024
Comprises the following steps:
Figure FDA0003924606150000025
Figure FDA0003924606150000026
/>
wherein, alpha is a balance factor; c is the channel number of the characteristic diagram,
Figure FDA0003924606150000027
is a re-dimensioned feature map. />
CN202011460544.8A 2020-12-11 2020-12-11 Zero sample learning image classification method based on global and local context sensing Active CN112418351B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011460544.8A CN112418351B (en) 2020-12-11 2020-12-11 Zero sample learning image classification method based on global and local context sensing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011460544.8A CN112418351B (en) 2020-12-11 2020-12-11 Zero sample learning image classification method based on global and local context sensing

Publications (2)

Publication Number Publication Date
CN112418351A CN112418351A (en) 2021-02-26
CN112418351B true CN112418351B (en) 2023-04-07

Family

ID=74775587

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011460544.8A Active CN112418351B (en) 2020-12-11 2020-12-11 Zero sample learning image classification method based on global and local context sensing

Country Status (1)

Country Link
CN (1) CN112418351B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298091A (en) * 2021-05-25 2021-08-24 商汤集团有限公司 Image processing method and device, electronic equipment and storage medium
CN113435531B (en) * 2021-07-07 2022-06-21 中国人民解放军国防科技大学 Zero sample image classification method and system, electronic equipment and storage medium
CN113486981B (en) * 2021-07-30 2023-02-07 西安电子科技大学 RGB image classification method based on multi-scale feature attention fusion network
CN113673599B (en) * 2021-08-20 2024-04-12 大连海事大学 Hyperspectral image classification method based on correction prototype learning
CN116842329A (en) * 2023-07-10 2023-10-03 湖北大学 Motor imagery task classification method and system based on electroencephalogram signals and deep learning

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447115A (en) * 2018-09-25 2019-03-08 天津大学 Zero sample classification method of fine granularity based on multilayer semanteme supervised attention model
CN109582960A (en) * 2018-11-27 2019-04-05 上海交通大学 The zero learn-by-example method based on structured asso- ciation semantic embedding
CN110443273A (en) * 2019-06-25 2019-11-12 武汉大学 A kind of zero sample learning method of confrontation identified for natural image across class
CN111222471A (en) * 2020-01-09 2020-06-02 中国科学技术大学 Zero sample training and related classification method based on self-supervision domain perception network
CN111598155A (en) * 2020-05-13 2020-08-28 北京工业大学 Fine-grained image weak supervision target positioning method based on deep learning
CN111881262A (en) * 2020-08-06 2020-11-03 重庆邮电大学 Text emotion analysis method based on multi-channel neural network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10366166B2 (en) * 2017-09-07 2019-07-30 Baidu Usa Llc Deep compositional frameworks for human-like language acquisition in virtual environments

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447115A (en) * 2018-09-25 2019-03-08 天津大学 Zero sample classification method of fine granularity based on multilayer semanteme supervised attention model
CN109582960A (en) * 2018-11-27 2019-04-05 上海交通大学 The zero learn-by-example method based on structured asso- ciation semantic embedding
CN110443273A (en) * 2019-06-25 2019-11-12 武汉大学 A kind of zero sample learning method of confrontation identified for natural image across class
CN111222471A (en) * 2020-01-09 2020-06-02 中国科学技术大学 Zero sample training and related classification method based on self-supervision domain perception network
CN111598155A (en) * 2020-05-13 2020-08-28 北京工业大学 Fine-grained image weak supervision target positioning method based on deep learning
CN111881262A (en) * 2020-08-06 2020-11-03 重庆邮电大学 Text emotion analysis method based on multi-channel neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"Semantic-Guided Multi-Attention Localization for Zero-Shot Learning";Yizhe Zhu;《arXiv》;20191202;1-11页 *
"零样本学习中的细粒度图像分类研究";魏杰;《中国优秀硕士学位论文全文数据库信息科技辑》;20200215;正文18-42页 *

Also Published As

Publication number Publication date
CN112418351A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112418351B (en) Zero sample learning image classification method based on global and local context sensing
CN111476294B (en) Zero sample image identification method and system based on generation countermeasure network
CN108764063B (en) Remote sensing image time-sensitive target identification system and method based on characteristic pyramid
CN110059741B (en) Image recognition method based on semantic capsule fusion network
CN114067107B (en) Multi-scale fine-grained image recognition method and system based on multi-grained attention
CN111126482B (en) Remote sensing image automatic classification method based on multi-classifier cascade model
CN105825511A (en) Image background definition detection method based on deep learning
CN110633708A (en) Deep network significance detection method based on global model and local optimization
CN109583483A (en) A kind of object detection method and system based on convolutional neural networks
CN110991532B (en) Scene graph generation method based on relational visual attention mechanism
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN112861970B (en) Fine-grained image classification method based on feature fusion
CN111461213A (en) Training method of target detection model and target rapid detection method
CN115937774A (en) Security inspection contraband detection method based on feature fusion and semantic interaction
CN113159067A (en) Fine-grained image identification method and device based on multi-grained local feature soft association aggregation
CN114926693A (en) SAR image small sample identification method and device based on weighted distance
Lechgar et al. Detection of cities vehicle fleet using YOLO V2 and aerial images
CN114780767A (en) Large-scale image retrieval method and system based on deep convolutional neural network
CN113642602A (en) Multi-label image classification method based on global and local label relation
CN117173147A (en) Surface treatment equipment and method for steel strip processing
CN113688864B (en) Human-object interaction relation classification method based on split attention
CN114283289A (en) Image classification method based on multi-model fusion
CN111753915A (en) Image processing device, method, equipment and medium
CN110689071A (en) Target detection system and method based on structured high-order features
Yang et al. YOLOX with CBAM for insulator detection in transmission lines

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant