CN116433969A - Zero sample image recognition method, system and storable medium - Google Patents

Zero sample image recognition method, system and storable medium Download PDF

Info

Publication number
CN116433969A
CN116433969A CN202310303332.6A CN202310303332A CN116433969A CN 116433969 A CN116433969 A CN 116433969A CN 202310303332 A CN202310303332 A CN 202310303332A CN 116433969 A CN116433969 A CN 116433969A
Authority
CN
China
Prior art keywords
class
visual
invisible
visible
classes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310303332.6A
Other languages
Chinese (zh)
Inventor
赵鹏
薛惠慧
姚晟
李麟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN202310303332.6A priority Critical patent/CN116433969A/en
Publication of CN116433969A publication Critical patent/CN116433969A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/088Non-supervised learning, e.g. competitive learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a zero sample image recognition method, a system and a storable medium, wherein the zero sample image recognition method comprises the following steps: acquiring a data set; designing an attention mechanism to extract the identification visual characteristics of the visible images; performing mean operation on all images belonging to the same visual category to obtain a visual prototype of the visual category; obtaining a visual prototype of the invisible class by migrating semantic attribute relationships between the visible class and the invisible class; constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation; designing an encoder to perform node information propagation and aggregation to obtain a new potential space; training a model by using the visible type image and the label; and predicting the invisible image by using the trained model. The invention obtains the visual prototype representations of all classes through the attention mechanism and the semantic relation among the classes, and obtains the discrimination potential space through the propagation of the visual prototype graph, so that the classification is carried out in the potential space, and the classification accuracy is improved.

Description

Zero sample image recognition method, system and storable medium
Technical Field
The present invention relates to the field of image recognition technologies, and in particular, to a method and system for recognizing a zero sample image, and a storable medium.
Background
With the development of deep neural networks, image classification has made tremendous progress over the past years. However, most successful models are based on supervised learning, which is highly dependent on a large number of marker images to train. In many practical applications, collecting large scale marker datasets is expensive and time consuming. This problem becomes more serious when fine-grained datasets are involved. Zero sample learning is therefore of increasing interest, which can identify images from invisible classes. Zero sample learning aims to learn about invisible classes by applying learned knowledge of visible classes. The class of marked samples given in the training phase is referred to as the visible class, while there are also some unmarked samples, the class containing these unmarked samples is referred to as the invisible class, and the set of visible classes and the set of invisible classes are disjoint.
The invention patent application with publication number of CN113505701A discloses a zero sample image recognition method of a variational self-encoder combined with a knowledge graph. Firstly, encoding image features extracted through a convolutional neural network into low-dimensional feature vectors through a VAE, and putting into a potential feature space; then, the category semantic vector is sent to a depth neural network module based on a knowledge graph, nodes in the graph are aggregated through a graph variation self-encoder, and a new low-dimensional semantic vector generated after coding updating is put into a potential feature space; and finally, decoding the potential vector generated by each mode by using a decoder of another mode under the condition of the same category, and reconstructing the original data.
The method in the patent application of the invention with publication number CN113505701a uses knowledge acquired from a knowledge graph to make a composition, but considers that the relationship between categories contained in the knowledge graph is not accurate enough, resulting in that the relationship between the visible category and the invisible category cannot be modeled well, thereby affecting the ability of knowledge migration, and for some fine-grained datasets, the category relationship is difficult to acquire. Meanwhile, the visual features of the image contain abundant semantic information, but also contain much background information and noise information. In classifying images, only some of the identified visual areas are useful for classification, especially for fine-grained images, where the difference in image between different classes is small. The method of the invention patent application publication No. CN113505701a extracts image features using only a pre-trained network, and the excavation of the identified visual features is not sufficient.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a zero sample image recognition method, a zero sample image recognition system and a storage medium, visual prototype representations of all classes are obtained through a attention mechanism and semantic relations among the classes, then potential space is obtained through propagation of a visual prototype graph, classification is carried out in the potential space, and classification accuracy is improved.
In order to achieve the above purpose, the technical scheme of the invention is realized as follows:
a zero sample image recognition method comprising the steps of:
s1, acquiring a data set comprising visible classes and invisible classes, wherein the visible classes are classes containing images in a training set, have images of the visible classes, class labels and semantic attributes of the visible classes, the invisible classes are classes not containing images in the training set, have semantic attributes of the invisible classes, and the invisible class images are used as a prediction recognition stage;
s2, designing an attention mechanism to extract the identification visual characteristics of the visible images;
s3, performing an average operation on all images belonging to the same visual category to obtain a visual prototype of the visual category;
s4, obtaining a visual prototype of the invisible class by migrating semantic attribute relations between the visible class and the invisible class;
s5, constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation;
s6, designing an encoder to conduct node information propagation and aggregation so as to obtain a new potential space;
s7, training a model by using the visible images and the labels;
s8, predicting the invisible image by using the trained model.
Further, in step S2, an attention mechanism is used to obtain an identifying characteristic v of each image x to remove irrelevant information;
specifically, for an image x, the image x is firstly sent to a backbone network to obtain a characteristic diagram Z E R W×H×C Obtaining K area blocks from the feature map Z through a spatial attention mechanism, wherein the most identified feature areas in the image are expected to be found;
the specific operation is as follows:
first, K mask blocks M are learned by convolution operation k ∈R W×H
M k =σ(Conv(Z)),k=1,2,...,K
Wherein Conv (-) represents a 1 x 1 convolution operation and σ (-) represents an activation function;
then, the K mask blocks are subjected to Reshape operation, changed into the same size as the feature map Z, and multiplied by the feature map Z to obtain K regional blocks of the image:
Figure BDA0004145852140000031
wherein,,
Figure BDA0004145852140000041
representing the multiplication of the corresponding bit element. Considering that the obtained K regional blocks possibly contain background information or redundant information exists among different regional blocks, a threshold limit is carried out on the K regional blocks to obtainMaximum value m of K mask blocks max
Figure BDA0004145852140000042
Again, a super parameter α is designed and a threshold τ=α×m is set max Where α is a number between 0 and 1, when the kth mask block M k When the maximum value of the number of the blocks is smaller than the threshold value tau, the kth region block R is formed k (Z) all set to 0 and then for K thresholded region blocks
Figure BDA0004145852140000043
Global maximization is carried out to obtain K regional features r k ∈R C
Finally, the K local area features are spliced and then pass through a full connection layer f 1 The input-output dimension is KC-C, and the identification visual characteristic v E R corresponding to the image is obtained C
Further, in step S3, after the identifying visual feature v of each image is obtained, an average operation is performed on all images belonging to the same visual category to obtain a visual prototype P of the visual category seen
Figure BDA0004145852140000044
Where n represents the number of samples belonging to the ith visible class.
Further, in step S4, a relationship matrix between the visible class and the invisible class is obtained by calculating cosine similarity between attribute vectors of the visible class and the invisible class
Figure BDA0004145852140000045
S ij =cos(z i ,z j ),i∈C s and j∈C u
Wherein z is i Representing the ith classAttribute vector, z j Attribute vector representing jth class, C s Representing the number of visible classes, C u Representing the number of invisible classes;
obtaining a visual prototype of the invisible class by migrating semantic relationships between the visible class and the invisible class; i.e. obtaining the visual prototype matrix P of the invisible class from the semantic relation matrix between the visible class and the invisible class unseen
P unseen =S T P seen
Further, in step S5, visual prototypes of all categories can be obtained through the above operations, and a visual prototyping graph G is constructed by using the relationships between the category visual prototypes, and each node represents a category, including a visible category and an invisible category;
specifically, the relationship of edges in the visual prototype graph G is measured by cosine distances between the class visual prototypes:
B ij =cos(P i ,P j )
acquiring word vector representations a of attributes of each class using the GloVe model i ∈R 300 Stacking word vector representations of all attributes of the class into a matrix T E R |A|×300 (|A| represents the number of class attributes), and multiplying the class semantic vectors to obtain an initialized representation of each node:
E i =z i T
wherein z is i Semantic vector representing the ith class, E i Representing the initialization vector of the ith node.
Further, in step S6, the obtained adjacent matrix B and the matrix E stacked by the initialized representations of all the nodes are input into the encoder to perform the propagation and aggregation of the node information;
specifically, the encoder utilizes the graph convolution neural network to aggregate and propagate information on the constructed visual prototype graph G so as to obtain a new potential space U, and the node update is represented as follows:
H (i) =σ(D -1/2 BD -1/2 H (i-1) W (i-1) ),i=1,2
wherein D represents a degree matrix of the adjacency matrix B, D ii =∑ j B ij ,W (i) Representing a parameter matrix, H (0) =e represents the initialized input matrix of all nodes, H (2) The output matrix of the second layer graph convolution, i.e. the matrix of the embedded vectors of all nodes updated on the visual prototype graph, =u
Figure BDA0004145852140000061
Wherein u is i ∈R d An embedded representation representing node i;
the resulting latent representation matrix U is input to a decoder, which reconstructs the adjacency matrix by embedding the inner product of vectors:
Figure BDA0004145852140000062
constructing reconstruction losses to minimize the reconstructed adjacency matrix represented by the node vectors
Figure BDA0004145852140000063
The differences from the original adjacency matrix are such that the embedded vector of nodes represents the structure of the coincidence graph; reconstruction loss L rec The design is as follows:
Figure BDA0004145852140000064
further, in step S7, after the potential space U is obtained, the visual features and the labels extracted from the visual images by the attention mechanism are embedded into the potential space, and classified in the potential space;
specifically, the visual characteristic v E R of the visual image C Through one layer of full connection f 2 The input-output dimension is C-d, embedded into the latent space, and the similarity of the mapped visual features and the latent representation of each class is calculated:
Q ij =f 2 (v i )⊙U j
meanwhile, the cross entropy loss function is utilized to construct classification loss:
Figure BDA0004145852140000065
wherein when the ith sample belongs to the kth class, y ik =1, otherwise, y ik =0;N s Representing the number of visible class samples;
adding the reconstruction loss and the classification loss to obtain an integral loss function, and optimizing model parameters through gradient back propagation;
the total loss function of the final model is:
L=L cls +γL rec
where γ is a hyper-parameter.
Further, in step S8, potential representations for all classes are obtained by the trained models; when a test image x, namely an invisible class image, is given, the visual characteristic v of the image is obtained through a trained attention mechanism, then the visual characteristic v is mapped to a potential space, similarity calculation is carried out on class potential representation, and finally the process of predicting the label is expressed as follows:
Figure BDA0004145852140000071
in order to achieve the above object, the present invention further provides a zero sample image recognition system, including:
data set acquisition and definition module: the method comprises the steps of acquiring a data set comprising visible classes and invisible classes, wherein the visible classes are classes containing images in a training set, have images of the visible classes, class labels and semantic attributes of the visible classes, the invisible classes are classes not containing images in the training set, have semantic attributes of the invisible classes, and the invisible class images are used as a prediction recognition stage;
an identifying visual characteristic extraction module: extracting the identification visual characteristics of the visible images through an attention mechanism;
visual prototype extraction module of visible class: performing a mean operation on all images belonging to the same visual class to obtain a visual prototype of the visual class;
visual prototype extraction module of invisible class: obtaining a visual prototype of the invisible class by migrating semantic attribute relationships between the visible class and the invisible class;
a node initialization module of the visual prototype graph: constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation;
potential space acquisition module: the method comprises the steps of performing node information propagation and aggregation by designing an encoder to obtain a new potential space;
training module: training a model by using the visible type image and the label;
the invisible image classification module: and predicting the invisible image by using the trained model.
In order to achieve the above object, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the zero sample image recognition method as described above.
The beneficial effects are that: the invention obtains the visual prototype representations of all classes through the attention mechanism and the semantic relation among the classes, and obtains the discrimination potential space through the propagation of the visual prototype graph, so that the classification is carried out in the potential space, and the classification accuracy is improved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:
FIG. 1 is a flowchart of a zero sample image recognition method according to an embodiment of the present invention;
FIG. 2 is a frame diagram of a training phase of a zero sample image recognition method according to an embodiment of the present invention;
FIG. 3 is a frame diagram of a predictive recognition stage of a zero sample image recognition method according to an embodiment of the present invention;
fig. 4 is a schematic structural diagram of a zero sample image recognition system according to an embodiment of the present invention.
Detailed Description
It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.
The invention will be described in detail below with reference to the drawings in connection with embodiments.
The goal of zero sample learning is to classify images that have not been seen during the training phase. The knowledge migration from visible class to invisible class is realized by establishing the relation between different classes through auxiliary semantic information. One goal is to establish an association between the visual domain and the semantic domain when using semantic information to migrate knowledge from a visible class to an invisible class in zero sample learning. Typically, this association is determined by learning the embedding space of semantic vector and visual feature interactions. There are three mapping methods for learning such an embedding space, including semantic space based embedding, visual space based embedding, and public space based embedding. Whereas the invention of CN113505701a belongs to a zero sample learning method based on public space embedding.
The invention of CN113505701a relatively suffers from the following technical drawbacks, firstly, because the visual features and semantic representations are distributed in different spaces, and the dimensional differences are large, information loss can occur when the device is embedded into any space; secondly, the measurement is carried out only through a compatibility function, so that the interaction between visual features and semantic representations cannot be well carried out; in addition, the visual features of the images contain rich semantic information, but also contain a plurality of types of irrelevant information, and some images which are similar in comparison can be misclassified due to the types of irrelevant information; finally, manually defined class attributes, while accurate, may ignore some important information due to their dimensional limitations, thereby reducing the ability to migrate knowledge.
Example 1
Based on the above theoretical research analysis of the prior art, see fig. 1-3: the embodiment provides a zero sample image recognition method, which comprises the following steps:
s1, acquiring a data set comprising visible classes and invisible classes, wherein the visible classes are classes containing images in a training set, have images of the visible classes, class labels and semantic attributes of the visible classes, the invisible classes are classes not containing images in the training set, have semantic attributes of the invisible classes, and the invisible class images are used as a prediction recognition stage;
it should be noted that, the public data set used by the model of the present embodiment may include: fine-grained bird species data set CUB-200-2011Birds (CUB), animal data set Animals with Attributes (AWA 2), scene data set SUN Attribute (SUN) and APascal and Yahoo (aPY) data sets;
classifying the public data set; dividing all categories of each data set into disjoint visible categories and invisible categories, and respectively obtaining corresponding images and category semantic attributes; the image of the visible class and the semantic attribute of the class are used for a model training stage, and the image of the invisible class is used for testing in a prediction recognition stage;
the CUB data set comprises 200 categories, including 150 visible categories and 50 invisible categories, wherein the total number of the categories is 11788, and each category has 312-dimensional semantic attributes; the AWA2 dataset has 50 categories, comprising 40 visible categories and 10 invisible categories, for a total of 37322 pictures, each category having 85-dimensional semantic attributes; the SUN data set comprises 717 categories, including 645 visible categories and 72 invisible categories, and 14340 pictures, wherein each category has 102-dimensional semantic attributes; the aPY dataset has 32 categories, including 20 visible categories and 12 invisible categories, for 15339 pictures, each category having 64-dimensional semantic attributes.
S2, designing an attention mechanism to extract the identification visual characteristics of the visible images;
s3, performing an average operation on all images belonging to the same visual category to obtain a visual prototype of the visual category;
s4, obtaining a visual prototype of the invisible class by migrating semantic attribute relations between the visible class and the invisible class;
s5, constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation;
s6, designing an encoder to conduct node information propagation and aggregation so as to obtain a new potential space;
s7, training a model by using the visible images and the labels;
s8, predicting the invisible image by using the trained model.
According to the embodiment, visual prototype representations of all classes are obtained through the attention mechanism and semantic relations among the classes, then the potential space is judged through propagation of the visual prototype graph, classification is carried out in the potential space, and classification accuracy is improved.
The zero sample image recognition method can meet the image recognition requirements of various invisible types, reduces the manpower and material resources consumed by image labeling under supervised learning, improves the task performance of recognizing invisible type images, and accelerates the research and application of zero sample classification in actual scenes.
Unlike the method in the patent application of the invention with publication number CN113505701a, only the pre-trained network is used to extract the image features, the embodiment provides that the distinctive visual features of the visible images are obtained through a attention mechanism, so as to remove some irrelevant information such as background, so that the visual features are more distinctive. Meanwhile, unlike the composition using a knowledge graph in the patent application of the invention with publication number CN113505701a, in this embodiment, a visual prototype graph is constructed by using the relationships between visual prototypes of all classes (including visible classes and invisible classes), so that the constructed graph structure is easier to obtain, the relationships between classes are more accurate, and the knowledge migration capability is improved. In addition, unlike the initialization node using a single semantic representation in the patent application of the invention with the publication number of CN113505701A, the invention fuses multiple semantic representations and improves the semantic representation of the class.
In a specific example, in step S2, an attention mechanism is used to obtain an identifying feature v for each image x to remove irrelevant information;
specifically, for an image x, the image x is firstly sent to a backbone network to obtain a characteristic diagram Z E R W×H×C Obtaining K area blocks from the feature map Z through a spatial attention mechanism, wherein the most identified feature areas in the image are expected to be found;
the specific operation is as follows:
first, K mask blocks M are learned by convolution operation k ∈R W×H
M k =σ(Conv(Z)),k=1,2,...,K
Wherein Conv (-) represents a 1 x 1 convolution operation and σ (-) represents an activation function;
then, the K mask blocks are subjected to Reshape operation, changed into the same size as the feature map Z, and multiplied by the feature map Z to obtain K regional blocks of the image:
Figure BDA0004145852140000121
wherein,,
Figure BDA0004145852140000122
representing the multiplication of the corresponding bit element. Considering that the obtained K regional blocks possibly contain background information or redundant information exists among different regional blocks, a threshold limit is carried out on the K regional blocks, and the maximum value m of the K mask blocks is obtained max
Figure BDA0004145852140000123
Again, a super parameter α is designed and a threshold τ=α×m is set max Where α is a number between 0 and 1, when the kth mask block M k When the maximum value of the number of the blocks is smaller than the threshold value tau, the kth region block R is formed k (Z) all set to 0 and then for K thresholded region blocks
Figure BDA0004145852140000124
Global maximization is carried out to obtain K regional features r k ∈R C
Finally, the K local area features are spliced and then pass through a full connection layer f 1 The input-output dimension is KC-C, and the identification visual characteristic v E R corresponding to the image is obtained C
The embodiment proposes to acquire a visual prototype of the visual class by using the distinctive visual characteristics of the visual class image, so that the acquired visual prototype of the visual class is more distinctive.
In a specific example, in step S3, after the identification visual feature v of each image is obtained, an average operation is performed on all images belonging to the same visual category to obtain a visual prototype P of the visual category seen
Figure BDA0004145852140000131
Where n represents the number of samples belonging to the ith visible class.
In a specific example, in step S4, a relationship matrix between the visible class and the invisible class is obtained by calculating cosine similarity between attribute vectors of the visible class and the invisible class
Figure BDA0004145852140000132
S ij =cos(zi,z j ),i∈C s and j∈C u
Wherein z is i Attribute vector, z, representing class i j Attribute vector representing jth class, C s Representing the number of visible classes, C u Representing the number of invisible classes;
since our approach is under inductive setup (i.e. only marked visual class data is utilized during training), visual prototypes of invisible classes cannot be obtained by the above steps. Therefore, we pass through migrationSemantic relationships between visible and invisible classes to obtain visual prototypes of the invisible classes; i.e. obtaining the visual prototype matrix P of the invisible class from the semantic relation matrix between the visible class and the invisible class unseen
P unseen =S T P seen
In a specific example, in step S5, visual prototypes of all classes can be obtained by the above operations, and a visual prototypes graph G is constructed by using the relationships between class visual prototypes, where each node represents a class, including a visible class and an invisible class;
specifically, the relationship of edges in the visual prototype graph G is measured by cosine distances between the class visual prototypes:
B ij =cos(P i ,P j )
acquiring word vector representations a of attributes of each class using the GloVe model i ∈R 300 Stacking word vector representations of all attributes of the class into a matrix T E R |A|×300 (|A| represents the number of class attributes), and multiplying the class semantic vectors to obtain an initialized representation of each node:
E i =z i T
wherein z is i Semantic vector representing the ith class, E i Representing the initialization vector of the ith node.
It should be noted that, in order to better initialize node representation, the class attribute representation and the word vector representation of each attribute are fused, so as to fully mine and improve the semantic representation of the class and facilitate the information propagation afterwards;
according to the embodiment, the visual prototypes of the invisible classes are obtained by utilizing the semantic relations between the visible classes and the invisible classes, and a visual prototyping diagram is constructed by utilizing the relations among the visual prototypes of all the classes, so that the class relation modeling is more accurate.
In a specific example, in step S6, the obtained matrix E stacked by the adjacent matrix B and the initialized representations of all the nodes is input into an encoder to perform propagation and aggregation of node information;
specifically, the encoder utilizes the graph convolution neural network to aggregate and propagate information on the constructed visual prototype graph G so as to obtain a new potential space U, and the node update is represented as follows:
H (i) =σ(D -1/2 BD -1/2 H (i-1) W (i-1) ),i=1,2
wherein D represents a degree matrix of the adjacency matrix B, D ii =∑ j B ij ,W (i) Representing a parameter matrix, H (0) =e represents the initialized input matrix of all nodes, H 2 ) The output matrix of the second layer graph convolution, i.e. the matrix of the embedded vectors of all nodes updated on the visual prototype graph, =u
Figure BDA0004145852140000151
Wherein u is i ∈R d An embedded representation representing node i;
the resulting latent representation matrix U is input to a decoder, which reconstructs the adjacency matrix by embedding the inner product of vectors:
Figure BDA0004145852140000152
constructing reconstruction losses to minimize the reconstructed adjacency matrix represented by the node vectors
Figure BDA0004145852140000153
The differences from the original adjacency matrix are such that the embedded vector of nodes represents the structure of the coincidence graph; reconstruction loss L rec The design is as follows:
Figure BDA0004145852140000154
it can be appreciated that the embodiment better promotes the information interaction between semantic representation and visual representation by using the encoder realized by the graph convolution neural network to aggregate and propagate the information of the semantic-like representation on the visual prototype graph to obtain a potential space; meanwhile, the decoder reconstructs an adjacency matrix of the visual prototype graph by utilizing the potential structural representation, so that the embedded vector representation of the node accords with the structure of the graph, and the discriminant of the potential space is improved.
In a specific example, in step S7, after the potential space U is obtained, the visual features and the labels extracted from the visual images by the attention mechanism are embedded into the potential space, and classified in the potential space;
specifically, the visual characteristic v E R of the visual image C Through one layer of full connection f 2 The input-output dimension is C-d, embedded into the latent space, and the similarity of the mapped visual features and the latent representation of each class is calculated:
Q ij =f 2 (v i )⊙U j
meanwhile, the cross entropy loss function is utilized to construct classification loss:
Figure BDA0004145852140000161
wherein when the ith sample belongs to the kth class, y ik =1, otherwise, y ik =0;N s Representing the number of visible class samples;
adding the reconstruction loss and the classification loss to obtain an integral loss function, and optimizing model parameters through gradient back propagation;
the total loss function of the final model is:
L=L cls +γL rec
where γ is a hyper-parameter.
In a specific example, in step S8, potential representations for all classes are obtained from the trained models; when a test image x, namely an invisible class image, is given, the visual characteristic v of the image is obtained through a trained attention mechanism, then the visual characteristic v is mapped to a potential space, similarity calculation is carried out on class potential representation, and finally the process of predicting the label is expressed as follows:
Figure BDA0004145852140000162
in summary, compared with the existing zero sample learning method based on public space embedding, the method removes some kinds of irrelevant information by mining discriminative visual features, so that the obtained class visual prototype is more discriminative; the relation between classes can be accurately modeled without any external information in the composition process; a potential space is obtained through the structure of the self-encoder, the nodes fused with various semantic representations are subjected to information propagation and aggregation on the visual prototype graph by using the graph convolution neural network, information interaction among different modes is promoted, and the knowledge migration capability is improved.
In the zero sample image recognition experiment, the experimental results are shown in table 1. In Table 1, the optimal values for each column are bolded, "-" represents that no experiment was performed on the dataset.
Figure BDA0004145852140000171
GAFE and ACMR are aided by the structure of the self-encoder. The GAFE learns the mapping of visual features to semantic space using the encoder and the decoder reconstructs the original features using the learned mapping. ACMR utilizes two parallel variational self-encoders to extract visual latent representations and semantic latent representations, respectively, while an information enhancement module is presented to enhance the recognition capability of latent variables. The proposed method learns potential representations by an encoder based on a graph convolution neural network, aggregates and propagates information of semantic-like representations on a visual prototype graph, and learns structured semantic embedding obtained from different spaces to obtain a discriminated potential space by using non-redundant and complementary information provided between a plurality of modalities. At the same time, the decoder section reconstructs the adjacency matrix of the graph using the learned potential embedded representations such that the updated potential embedded representations conform to the structure of the original graph. It can be seen from the table that we propose methods with a greater improvement than these methods, such as compared to ACMR methods, with an improvement of 18.5%,3.8% and 19.8% in the CUB, AWA2 and SUN datasets, respectively.
APNet, HGKT and KG-VAE establish inter-class relationships using graph structures. The APNet performs similarity measurement to generate edges from node feature representations (attribute vectors) and uses the attention mechanism for graph propagation. HGKT firstly models the relation between visible classes according to the representative nodes of the classes under the k-nearest neighbor scheme, and connects the invisible classes with k nearest visible classes in the visual feature space after graph propagation to obtain embedded representations of the invisible classes. And the KG-VAE sends the category semantic vector into a depth neural network module based on the knowledge graph, and the nodes in the graph are aggregated and updated through a graph variation self-encoder to generate a new semantic vector by encoding. The method provided by the invention uses the relation between the visual prototypes of the categories to model, uses semantic representation fused with the attribute word vector and the class attribute as node characteristics, and can effectively fuse various modal information after graph propagation so as to promote information interaction in different spaces. Meanwhile, the adjacency matrix of the potential feature reconstruction is used as a constraint, so that the embedded vector of the node represents the structure of the coincidence graph. Experimental results show that our proposed method is helpful to improve classification accuracy, as compared to KG-VAE method in the patent application of the invention with publication No. CN113505701A, our method improves accuracy by 13.0%,8.3% and 1.5% on CUB, AWA2 and SUN datasets, respectively.
Finally, the proposed method also achieves a larger performance improvement compared to some feature-based methods, such as f-CLSWGAN and LisGAN.
Example 2
To achieve the above object, see fig. 4: the embodiment also provides a zero sample image recognition system, which comprises the following modules:
data set acquisition and definition module: the method comprises the steps of acquiring a data set comprising visible classes and invisible classes, wherein the visible classes are classes containing images in a training set, have images of the visible classes, class labels and semantic attributes of the visible classes, the invisible classes are classes not containing images in the training set, have semantic attributes of the invisible classes, and the invisible class images are used as a prediction recognition stage;
an identifying visual characteristic extraction module: extracting the identification visual characteristics of the visible images through an attention mechanism;
visual prototype extraction module of visible class: performing a mean operation on all images belonging to the same visual class to obtain a visual prototype of the visual class;
visual prototype extraction module of invisible class: obtaining a visual prototype of the invisible class by migrating semantic attribute relationships between the visible class and the invisible class;
a node initialization module of the visual prototype graph: constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation;
potential space acquisition module: the method comprises the steps of performing node information propagation and aggregation by designing an encoder to obtain a new potential space;
training module: training a model by using the visible type image and the label;
the invisible image classification module: and predicting the invisible image by using the trained model.
The zero sample image recognition system of the present embodiment has the same advantages as the above-mentioned zero sample image recognition method compared with the prior art, and will not be described in detail herein.
Example 3
In order to achieve the above object, the present embodiment also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the steps of the zero sample image recognition method as described above.
Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims (10)

1. A zero sample image recognition method, comprising the steps of:
s1, acquiring a data set comprising visible classes and invisible classes, wherein the visible classes are classes containing images in a training set, have images of the visible classes, class labels and semantic attributes of the visible classes, the invisible classes are classes not containing images in the training set, have semantic attributes of the invisible classes, and the invisible class images are used as a prediction recognition stage;
s2, designing an attention mechanism to extract the identification visual characteristics of the visible images;
s3, performing an average operation on all images belonging to the same visual category to obtain a visual prototype of the visual category;
s4, obtaining a visual prototype of the invisible class by migrating semantic attribute relations between the visible class and the invisible class;
s5, constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation;
s6, designing an encoder to conduct node information propagation and aggregation so as to obtain a new potential space;
s7, training a model by using the visible images and the labels;
s8, predicting the invisible image by using the trained model.
2. The zero sample image recognition method according to claim 1, wherein in step S2, an attention mechanism is used to obtain an identifying feature v for each image x to remove irrelevant information;
specifically, for an image x, the image x is firstly sent to a backbone network to obtain a characteristic diagram Z E R W×H×C Obtaining K area blocks from the feature map Z through a spatial attention mechanism, wherein the most identified feature areas in the image are expected to be found;
the specific operation is as follows:
first, K mask blocks M are learned by convolution operation k ∈R W×H
M k =σ(Conv(Z)),k=1,2,...,K
Wherein Conv (-) represents a 1 x 1 convolution operation and σ (-) represents an activation function;
then, the K mask blocks are subjected to Reshape operation, changed into the same size as the feature map Z, and multiplied by the feature map Z to obtain K regional blocks of the image:
Figure FDA0004145852120000021
wherein,,
Figure FDA0004145852120000022
representing the multiplication of the corresponding bit element. Considering that the obtained K regional blocks possibly contain background information or redundant information exists among different regional blocks, a threshold limit is carried out on the K regional blocks, and the maximum value m of the K mask blocks is obtained max
Figure FDA0004145852120000023
Again, a super parameter α is designed and a threshold τ=α×m is set max Where α is a number between 0 and 1, when the kth mask block M k When the maximum value of the number of the blocks is smaller than the threshold value tau, the kth region block R is formed k (Z) all set to 0 and then for K thresholded region blocks
Figure FDA0004145852120000025
Global maximization is carried out to obtain K regional features r k ∈R C
Finally, the K local area features are spliced and then pass through a full connection layer f 1 The input-output dimension is KC-C, and the identification visual characteristic v E R corresponding to the image is obtained C
3. The method according to claim 2, wherein in step S3, after the identification visual feature v of each image is obtained, an average operation is performed on all images belonging to the same visual class to obtain a visual prototype P of the visual class seen
Figure FDA0004145852120000024
Where n represents the number of samples belonging to the ith visible class.
4. A zero sample image recognition method according to claim 3, wherein in step S4, a relation matrix between a visible class and an invisible class is obtained by calculating cosine similarity between attribute vectors of the visible class and the invisible class
Figure FDA0004145852120000031
S ij =cos(z i ,z j ),i∈C s and j∈C u
Wherein z is i Attribute vector, z, representing class i j Attribute vector representing jth class, C s Representing the number of visible classes, C u Representing the number of invisible classes;
obtaining a visual prototype of the invisible class by migrating semantic relationships between the visible class and the invisible class; i.e. obtaining the visual prototype matrix P of the invisible class from the semantic relation matrix between the visible class and the invisible class unseen
P unseen =S T P seen
5. The zero-sample image recognition method according to claim 4, wherein in step S5, visual prototypes of all categories are obtained by the above operations, and a visual prototype graph G is constructed by using the relationship between the category visual prototypes, each node representing a category including a visible category and an invisible category;
specifically, the relationship of edges in the visual prototype graph G is measured by cosine distances between the class visual prototypes:
B ij =cos(P i ,P j )
acquiring word vector representations a of attributes of each class using the GloVe model i ∈R 300 Stacking word vector representations of all attributes of the class into a matrix T E R |A|×300 The A| represents the number of class attributes, and the initialization representation of each node is obtained by multiplying the class semantic vectors:
E i =z i T
wherein z is i Semantic vector representing the ith class, E i Representing the initialization vector of the ith node.
6. The zero-sample image recognition method according to claim 5, wherein in step S6, the obtained matrix E in which the adjacent matrix B and the initialized representations of all the nodes are stacked is input into an encoder for propagation and aggregation of node information;
specifically, the encoder utilizes the graph convolution neural network to aggregate and propagate information on the constructed visual prototype graph G so as to obtain a new potential space U, and the node update is represented as follows:
H (i) =σ(D -1/2 BD -1/2 H (i-1) W (i-1) ),i=1,2
wherein D represents a degree matrix of the adjacency matrix B, D ii =∑ j B ij ,W (i) Representing a parameter matrix, H (0) =e represents the initialized input matrix of all nodes, H (2) The output matrix of the second layer graph convolution, i.e. the matrix of the embedded vectors of all nodes updated on the visual prototype graph, =u
Figure FDA0004145852120000041
Wherein u is i ∈R d An embedded representation representing node i;
the resulting latent representation matrix U is input to a decoder, which reconstructs the adjacency matrix by embedding the inner product of vectors:
Figure FDA0004145852120000042
constructing reconstruction losses to minimize the reconstructed adjacency matrix represented by the node vectors
Figure FDA0004145852120000044
The differences from the original adjacency matrix are such that the embedded vector of nodes represents the structure of the coincidence graph; reconstruction loss L rec The design is as follows:
Figure FDA0004145852120000043
7. the zero-sample image recognition method according to claim 6, wherein in step S7, after obtaining the potential space U, the visual features and the labels extracted from the visual class image through the attention mechanism are embedded into the potential space, and classified in the potential space;
specifically, the visual characteristic v E R of the visual image C Through one layer of full connection f 2 The input-output dimension is C-d, embedded into the latent space, and the similarity of the mapped visual features and the latent representation of each class is calculated:
Q ij =f 2 (v i )⊙U j
meanwhile, the cross entropy loss function is utilized to construct classification loss:
Figure FDA0004145852120000051
wherein when the ith sample belongs to the kth class, y ik =1, otherwise, y ik =0;N s Representing the number of visible class samples;
adding the reconstruction loss and the classification loss to obtain an integral loss function, and optimizing model parameters through gradient back propagation;
the total loss function of the final model is:
L=L cls +γL rec
where γ is a hyper-parameter.
8. The method of zero sample image recognition according to claim 7, wherein in step S8, potential representations for all classes are obtained by means of the trained model; when a test image x, namely an invisible class image, is given, the visual characteristic v of the image is obtained through a trained attention mechanism, then the visual characteristic v is mapped to a potential space, similarity calculation is carried out on class potential representation, and finally the process of predicting the label is expressed as follows:
Figure FDA0004145852120000052
9. a zero sample image recognition system, comprising the following modules:
data set acquisition and definition module: the method comprises the steps of acquiring a data set comprising visible classes and invisible classes, wherein the visible classes are classes containing images in a training set, have images of the visible classes, class labels and semantic attributes of the visible classes, the invisible classes are classes not containing images in the training set, have semantic attributes of the invisible classes, and the invisible class images are used as a prediction recognition stage;
an identifying visual characteristic extraction module: extracting the identification visual characteristics of the visible images through an attention mechanism;
visual prototype extraction module of visible class: performing a mean operation on all images belonging to the same visual class to obtain a visual prototype of the visual class;
visual prototype extraction module of invisible class: obtaining a visual prototype of the invisible class by migrating semantic attribute relationships between the visible class and the invisible class;
a node initialization module of the visual prototype graph: constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation;
potential space acquisition module: the method comprises the steps of performing node information propagation and aggregation by designing an encoder to obtain a new potential space;
training module: training a model by using the visible type image and the label;
the invisible image classification module: and predicting the invisible image by using the trained model.
10. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the zero sample image recognition method according to any one of claims 1 to 8.
CN202310303332.6A 2023-03-24 2023-03-24 Zero sample image recognition method, system and storable medium Pending CN116433969A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310303332.6A CN116433969A (en) 2023-03-24 2023-03-24 Zero sample image recognition method, system and storable medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310303332.6A CN116433969A (en) 2023-03-24 2023-03-24 Zero sample image recognition method, system and storable medium

Publications (1)

Publication Number Publication Date
CN116433969A true CN116433969A (en) 2023-07-14

Family

ID=87088305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310303332.6A Pending CN116433969A (en) 2023-03-24 2023-03-24 Zero sample image recognition method, system and storable medium

Country Status (1)

Country Link
CN (1) CN116433969A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333778A (en) * 2023-12-01 2024-01-02 华南理工大学 Knowledge-graph-based zero-sample plant identification method for plant science popularization education
CN117649565A (en) * 2024-01-30 2024-03-05 安徽大学 Model training method, training device and medical image classification method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117333778A (en) * 2023-12-01 2024-01-02 华南理工大学 Knowledge-graph-based zero-sample plant identification method for plant science popularization education
CN117333778B (en) * 2023-12-01 2024-03-12 华南理工大学 Knowledge-graph-based zero-sample plant identification method for plant science popularization education
CN117649565A (en) * 2024-01-30 2024-03-05 安徽大学 Model training method, training device and medical image classification method
CN117649565B (en) * 2024-01-30 2024-05-28 安徽大学 Model training method, training device and medical image classification method

Similar Documents

Publication Publication Date Title
CN111461258B (en) Remote sensing image scene classification method of coupling convolution neural network and graph convolution network
CN111369572B (en) Weak supervision semantic segmentation method and device based on image restoration technology
CN116433969A (en) Zero sample image recognition method, system and storable medium
CN107330074B (en) Image retrieval method based on deep learning and Hash coding
CN110826638B (en) Zero sample image classification model based on repeated attention network and method thereof
CN112381098A (en) Semi-supervised learning method and system based on self-learning in target segmentation field
CN110930417A (en) Training method and device of image segmentation model, and image segmentation method and device
CN112699234A (en) General document identification method, system, terminal and storage medium
CN110728295B (en) Semi-supervised landform classification model training and landform graph construction method
CN111382758A (en) Training image classification model, image classification method, device, equipment and medium
CN114419672A (en) Cross-scene continuous learning pedestrian re-identification method and device based on consistency learning
CN112116599A (en) Sputum smear tubercle bacillus semantic segmentation method and system based on weak supervised learning
CN113821670A (en) Image retrieval method, device, equipment and computer readable storage medium
CN114048468A (en) Intrusion detection method, intrusion detection model training method, device and medium
CN111325766A (en) Three-dimensional edge detection method and device, storage medium and computer equipment
CN114332893A (en) Table structure identification method and device, computer equipment and storage medium
CN115905538A (en) Event multi-label classification method, device, equipment and medium based on knowledge graph
CN116363357A (en) Semi-supervised semantic segmentation method and device based on MIM and contrast learning
CN115439685A (en) Small sample image data set dividing method and computer readable storage medium
CN111985207A (en) Method and device for acquiring access control policy and electronic equipment
CN112163114A (en) Image retrieval method based on feature fusion
CN116740570A (en) Remote sensing image road extraction method, device and equipment based on mask image modeling
CN110781310A (en) Target concept graph construction method and device, computer equipment and storage medium
CN112801153B (en) Semi-supervised image classification method and system of image embedded with LBP (local binary pattern) features
CN114241516A (en) Pedestrian re-identification method and device based on pedestrian re-identification model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination