CN116433969A

CN116433969A - Zero sample image recognition method, system and storable medium

Info

Publication number: CN116433969A
Application number: CN202310303332.6A
Authority: CN
Inventors: 赵鹏; 薛惠慧; 姚晟; 李麟
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2023-03-24
Filing date: 2023-03-24
Publication date: 2023-07-14

Abstract

The invention provides a zero sample image recognition method, a system and a storable medium, wherein the zero sample image recognition method comprises the following steps: acquiring a data set; designing an attention mechanism to extract the identification visual characteristics of the visible images; performing mean operation on all images belonging to the same visual category to obtain a visual prototype of the visual category; obtaining a visual prototype of the invisible class by migrating semantic attribute relationships between the visible class and the invisible class; constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation; designing an encoder to perform node information propagation and aggregation to obtain a new potential space; training a model by using the visible type image and the label; and predicting the invisible image by using the trained model. The invention obtains the visual prototype representations of all classes through the attention mechanism and the semantic relation among the classes, and obtains the discrimination potential space through the propagation of the visual prototype graph, so that the classification is carried out in the potential space, and the classification accuracy is improved.

Description

Zero sample image recognition method, system and storable medium

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method and system for recognizing a zero sample image, and a storable medium.

Background

With the development of deep neural networks, image classification has made tremendous progress over the past years. However, most successful models are based on supervised learning, which is highly dependent on a large number of marker images to train. In many practical applications, collecting large scale marker datasets is expensive and time consuming. This problem becomes more serious when fine-grained datasets are involved. Zero sample learning is therefore of increasing interest, which can identify images from invisible classes. Zero sample learning aims to learn about invisible classes by applying learned knowledge of visible classes. The class of marked samples given in the training phase is referred to as the visible class, while there are also some unmarked samples, the class containing these unmarked samples is referred to as the invisible class, and the set of visible classes and the set of invisible classes are disjoint.

The invention patent application with publication number of CN113505701A discloses a zero sample image recognition method of a variational self-encoder combined with a knowledge graph. Firstly, encoding image features extracted through a convolutional neural network into low-dimensional feature vectors through a VAE, and putting into a potential feature space; then, the category semantic vector is sent to a depth neural network module based on a knowledge graph, nodes in the graph are aggregated through a graph variation self-encoder, and a new low-dimensional semantic vector generated after coding updating is put into a potential feature space; and finally, decoding the potential vector generated by each mode by using a decoder of another mode under the condition of the same category, and reconstructing the original data.

The method in the patent application of the invention with publication number CN113505701a uses knowledge acquired from a knowledge graph to make a composition, but considers that the relationship between categories contained in the knowledge graph is not accurate enough, resulting in that the relationship between the visible category and the invisible category cannot be modeled well, thereby affecting the ability of knowledge migration, and for some fine-grained datasets, the category relationship is difficult to acquire. Meanwhile, the visual features of the image contain abundant semantic information, but also contain much background information and noise information. In classifying images, only some of the identified visual areas are useful for classification, especially for fine-grained images, where the difference in image between different classes is small. The method of the invention patent application publication No. CN113505701a extracts image features using only a pre-trained network, and the excavation of the identified visual features is not sufficient.

Disclosure of Invention

In order to solve the problems, the invention aims to provide a zero sample image recognition method, a zero sample image recognition system and a storage medium, visual prototype representations of all classes are obtained through a attention mechanism and semantic relations among the classes, then potential space is obtained through propagation of a visual prototype graph, classification is carried out in the potential space, and classification accuracy is improved.

In order to achieve the above purpose, the technical scheme of the invention is realized as follows:

a zero sample image recognition method comprising the steps of:

s1, acquiring a data set comprising visible classes and invisible classes, wherein the visible classes are classes containing images in a training set, have images of the visible classes, class labels and semantic attributes of the visible classes, the invisible classes are classes not containing images in the training set, have semantic attributes of the invisible classes, and the invisible class images are used as a prediction recognition stage;

s2, designing an attention mechanism to extract the identification visual characteristics of the visible images;

s3, performing an average operation on all images belonging to the same visual category to obtain a visual prototype of the visual category;

s4, obtaining a visual prototype of the invisible class by migrating semantic attribute relations between the visible class and the invisible class;

s5, constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation;

s6, designing an encoder to conduct node information propagation and aggregation so as to obtain a new potential space;

s7, training a model by using the visible images and the labels;

s8, predicting the invisible image by using the trained model.

Further, in step S2, an attention mechanism is used to obtain an identifying characteristic v of each image x to remove irrelevant information;

specifically, for an image x, the image x is firstly sent to a backbone network to obtain a characteristic diagram Z E R ^W×H×C Obtaining K area blocks from the feature map Z through a spatial attention mechanism, wherein the most identified feature areas in the image are expected to be found;

the specific operation is as follows:

first, K mask blocks M are learned by convolution operation _k ∈R ^W×H ：

M _k ＝σ(Conv(Z))，k＝1，2，...，K

Wherein Conv (-) represents a 1 x 1 convolution operation and σ (-) represents an activation function;

then, the K mask blocks are subjected to Reshape operation, changed into the same size as the feature map Z, and multiplied by the feature map Z to obtain K regional blocks of the image:

wherein,,

representing the multiplication of the corresponding bit element. Considering that the obtained K regional blocks possibly contain background information or redundant information exists among different regional blocks, a threshold limit is carried out on the K regional blocks to obtainMaximum value m of K mask blocks _max ：

Again, a super parameter α is designed and a threshold τ=α×m is set _max Where α is a number between 0 and 1, when the kth mask block M _k When the maximum value of the number of the blocks is smaller than the threshold value tau, the kth region block R is formed _k (Z) all set to 0 and then for K thresholded region blocks

Global maximization is carried out to obtain K regional features r _k ∈R ^C ；

Finally, the K local area features are spliced and then pass through a full connection layer f ₁ The input-output dimension is KC-C, and the identification visual characteristic v E R corresponding to the image is obtained ^C 。

Further, in step S3, after the identifying visual feature v of each image is obtained, an average operation is performed on all images belonging to the same visual category to obtain a visual prototype P of the visual category ^seen ：

Where n represents the number of samples belonging to the ith visible class.

Further, in step S4, a relationship matrix between the visible class and the invisible class is obtained by calculating cosine similarity between attribute vectors of the visible class and the invisible class

S _ij ＝cos(z _i ，z _j )，i∈C ^s and j∈C ^u

Wherein z is _i Representing the ith classAttribute vector, z _j Attribute vector representing jth class, C ^s Representing the number of visible classes, C ^u Representing the number of invisible classes;

obtaining a visual prototype of the invisible class by migrating semantic relationships between the visible class and the invisible class; i.e. obtaining the visual prototype matrix P of the invisible class from the semantic relation matrix between the visible class and the invisible class ^unseen ：

P ^unseen ＝S ^T P ^seen 。

Further, in step S5, visual prototypes of all categories can be obtained through the above operations, and a visual prototyping graph G is constructed by using the relationships between the category visual prototypes, and each node represents a category, including a visible category and an invisible category;

specifically, the relationship of edges in the visual prototype graph G is measured by cosine distances between the class visual prototypes:

B _ij ＝cos(P _i ，P _j )

acquiring word vector representations a of attributes of each class using the GloVe model _i ∈R ³⁰⁰ Stacking word vector representations of all attributes of the class into a matrix T E R ^|A|×300 (|A| represents the number of class attributes), and multiplying the class semantic vectors to obtain an initialized representation of each node:

E _i ＝z _i T

wherein z is _i Semantic vector representing the ith class, E _i Representing the initialization vector of the ith node.

Further, in step S6, the obtained adjacent matrix B and the matrix E stacked by the initialized representations of all the nodes are input into the encoder to perform the propagation and aggregation of the node information;

specifically, the encoder utilizes the graph convolution neural network to aggregate and propagate information on the constructed visual prototype graph G so as to obtain a new potential space U, and the node update is represented as follows:

H ⁽ⁱ⁾ ＝σ(D ^-1/2 BD ^-1/2 H ^(i-1) W ^(i-1) )，i＝1，2

wherein D represents a degree matrix of the adjacency matrix B, D _ii ＝∑ _j B _ij ，W ⁽ⁱ⁾ Representing a parameter matrix, H ⁽⁰⁾ =e represents the initialized input matrix of all nodes, H ⁽²⁾ The output matrix of the second layer graph convolution, i.e. the matrix of the embedded vectors of all nodes updated on the visual prototype graph, =u

Wherein u is _i ∈R ^d An embedded representation representing node i;

the resulting latent representation matrix U is input to a decoder, which reconstructs the adjacency matrix by embedding the inner product of vectors:

constructing reconstruction losses to minimize the reconstructed adjacency matrix represented by the node vectors

The differences from the original adjacency matrix are such that the embedded vector of nodes represents the structure of the coincidence graph; reconstruction loss L _rec The design is as follows:

further, in step S7, after the potential space U is obtained, the visual features and the labels extracted from the visual images by the attention mechanism are embedded into the potential space, and classified in the potential space;

specifically, the visual characteristic v E R of the visual image ^C Through one layer of full connection f ₂ The input-output dimension is C-d, embedded into the latent space, and the similarity of the mapped visual features and the latent representation of each class is calculated:

Q _ij ＝f ₂ (v _i )⊙U _j

meanwhile, the cross entropy loss function is utilized to construct classification loss:

wherein when the ith sample belongs to the kth class, y _ik =1, otherwise, y _ik ＝0；N ^s Representing the number of visible class samples;

adding the reconstruction loss and the classification loss to obtain an integral loss function, and optimizing model parameters through gradient back propagation;

the total loss function of the final model is:

L＝L _cls +γL _rec

where γ is a hyper-parameter.

Further, in step S8, potential representations for all classes are obtained by the trained models; when a test image x, namely an invisible class image, is given, the visual characteristic v of the image is obtained through a trained attention mechanism, then the visual characteristic v is mapped to a potential space, similarity calculation is carried out on class potential representation, and finally the process of predicting the label is expressed as follows:

in order to achieve the above object, the present invention further provides a zero sample image recognition system, including:

data set acquisition and definition module: the method comprises the steps of acquiring a data set comprising visible classes and invisible classes, wherein the visible classes are classes containing images in a training set, have images of the visible classes, class labels and semantic attributes of the visible classes, the invisible classes are classes not containing images in the training set, have semantic attributes of the invisible classes, and the invisible class images are used as a prediction recognition stage;

an identifying visual characteristic extraction module: extracting the identification visual characteristics of the visible images through an attention mechanism;

visual prototype extraction module of visible class: performing a mean operation on all images belonging to the same visual class to obtain a visual prototype of the visual class;

visual prototype extraction module of invisible class: obtaining a visual prototype of the invisible class by migrating semantic attribute relationships between the visible class and the invisible class;

a node initialization module of the visual prototype graph: constructing a visual prototype graph by utilizing the relation between the class visual prototypes, and initializing node representation;

potential space acquisition module: the method comprises the steps of performing node information propagation and aggregation by designing an encoder to obtain a new potential space;

training module: training a model by using the visible type image and the label;

the invisible image classification module: and predicting the invisible image by using the trained model.

In order to achieve the above object, the present invention also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the zero sample image recognition method as described above.

The beneficial effects are that: the invention obtains the visual prototype representations of all classes through the attention mechanism and the semantic relation among the classes, and obtains the discrimination potential space through the propagation of the visual prototype graph, so that the classification is carried out in the potential space, and the classification accuracy is improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention. In the drawings:

FIG. 1 is a flowchart of a zero sample image recognition method according to an embodiment of the present invention;

FIG. 2 is a frame diagram of a training phase of a zero sample image recognition method according to an embodiment of the present invention;

FIG. 3 is a frame diagram of a predictive recognition stage of a zero sample image recognition method according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a zero sample image recognition system according to an embodiment of the present invention.

Detailed Description

It should be noted that, without conflict, the embodiments of the present invention and features of the embodiments may be combined with each other.

The invention will be described in detail below with reference to the drawings in connection with embodiments.

The goal of zero sample learning is to classify images that have not been seen during the training phase. The knowledge migration from visible class to invisible class is realized by establishing the relation between different classes through auxiliary semantic information. One goal is to establish an association between the visual domain and the semantic domain when using semantic information to migrate knowledge from a visible class to an invisible class in zero sample learning. Typically, this association is determined by learning the embedding space of semantic vector and visual feature interactions. There are three mapping methods for learning such an embedding space, including semantic space based embedding, visual space based embedding, and public space based embedding. Whereas the invention of CN113505701a belongs to a zero sample learning method based on public space embedding.

The invention of CN113505701a relatively suffers from the following technical drawbacks, firstly, because the visual features and semantic representations are distributed in different spaces, and the dimensional differences are large, information loss can occur when the device is embedded into any space; secondly, the measurement is carried out only through a compatibility function, so that the interaction between visual features and semantic representations cannot be well carried out; in addition, the visual features of the images contain rich semantic information, but also contain a plurality of types of irrelevant information, and some images which are similar in comparison can be misclassified due to the types of irrelevant information; finally, manually defined class attributes, while accurate, may ignore some important information due to their dimensional limitations, thereby reducing the ability to migrate knowledge.

Example 1

Based on the above theoretical research analysis of the prior art, see fig. 1-3: the embodiment provides a zero sample image recognition method, which comprises the following steps:

it should be noted that, the public data set used by the model of the present embodiment may include: fine-grained bird species data set CUB-200-2011Birds (CUB), animal data set Animals with Attributes (AWA 2), scene data set SUN Attribute (SUN) and APascal and Yahoo (aPY) data sets;

classifying the public data set; dividing all categories of each data set into disjoint visible categories and invisible categories, and respectively obtaining corresponding images and category semantic attributes; the image of the visible class and the semantic attribute of the class are used for a model training stage, and the image of the invisible class is used for testing in a prediction recognition stage;

the CUB data set comprises 200 categories, including 150 visible categories and 50 invisible categories, wherein the total number of the categories is 11788, and each category has 312-dimensional semantic attributes; the AWA2 dataset has 50 categories, comprising 40 visible categories and 10 invisible categories, for a total of 37322 pictures, each category having 85-dimensional semantic attributes; the SUN data set comprises 717 categories, including 645 visible categories and 72 invisible categories, and 14340 pictures, wherein each category has 102-dimensional semantic attributes; the aPY dataset has 32 categories, including 20 visible categories and 12 invisible categories, for 15339 pictures, each category having 64-dimensional semantic attributes.

s7, training a model by using the visible images and the labels;

s8, predicting the invisible image by using the trained model.

According to the embodiment, visual prototype representations of all classes are obtained through the attention mechanism and semantic relations among the classes, then the potential space is judged through propagation of the visual prototype graph, classification is carried out in the potential space, and classification accuracy is improved.

The zero sample image recognition method can meet the image recognition requirements of various invisible types, reduces the manpower and material resources consumed by image labeling under supervised learning, improves the task performance of recognizing invisible type images, and accelerates the research and application of zero sample classification in actual scenes.

Unlike the method in the patent application of the invention with publication number CN113505701a, only the pre-trained network is used to extract the image features, the embodiment provides that the distinctive visual features of the visible images are obtained through a attention mechanism, so as to remove some irrelevant information such as background, so that the visual features are more distinctive. Meanwhile, unlike the composition using a knowledge graph in the patent application of the invention with publication number CN113505701a, in this embodiment, a visual prototype graph is constructed by using the relationships between visual prototypes of all classes (including visible classes and invisible classes), so that the constructed graph structure is easier to obtain, the relationships between classes are more accurate, and the knowledge migration capability is improved. In addition, unlike the initialization node using a single semantic representation in the patent application of the invention with the publication number of CN113505701A, the invention fuses multiple semantic representations and improves the semantic representation of the class.

In a specific example, in step S2, an attention mechanism is used to obtain an identifying feature v for each image x to remove irrelevant information;

the specific operation is as follows:

first, K mask blocks M are learned by convolution operation _k ∈R ^W×H ：

M _k ＝σ(Conv(Z))，k＝1，2，...，K

wherein,,

representing the multiplication of the corresponding bit element. Considering that the obtained K regional blocks possibly contain background information or redundant information exists among different regional blocks, a threshold limit is carried out on the K regional blocks, and the maximum value m of the K mask blocks is obtained _max ：

The embodiment proposes to acquire a visual prototype of the visual class by using the distinctive visual characteristics of the visual class image, so that the acquired visual prototype of the visual class is more distinctive.

In a specific example, in step S3, after the identification visual feature v of each image is obtained, an average operation is performed on all images belonging to the same visual category to obtain a visual prototype P of the visual category ^seen ：

Where n represents the number of samples belonging to the ith visible class.

In a specific example, in step S4, a relationship matrix between the visible class and the invisible class is obtained by calculating cosine similarity between attribute vectors of the visible class and the invisible class

S _ij ＝cos(zi，z _j )，i∈C ^s and j∈C ^u

Wherein z is _i Attribute vector, z, representing class i _j Attribute vector representing jth class, C ^s Representing the number of visible classes, C ^u Representing the number of invisible classes;

since our approach is under inductive setup (i.e. only marked visual class data is utilized during training), visual prototypes of invisible classes cannot be obtained by the above steps. Therefore, we pass through migrationSemantic relationships between visible and invisible classes to obtain visual prototypes of the invisible classes; i.e. obtaining the visual prototype matrix P of the invisible class from the semantic relation matrix between the visible class and the invisible class ^unseen ：

P ^unseen ＝S ^T P ^seen 。

In a specific example, in step S5, visual prototypes of all classes can be obtained by the above operations, and a visual prototypes graph G is constructed by using the relationships between class visual prototypes, where each node represents a class, including a visible class and an invisible class;

B _ij ＝cos(P _i ，P _j )

E _i ＝z _i T

It should be noted that, in order to better initialize node representation, the class attribute representation and the word vector representation of each attribute are fused, so as to fully mine and improve the semantic representation of the class and facilitate the information propagation afterwards;

according to the embodiment, the visual prototypes of the invisible classes are obtained by utilizing the semantic relations between the visible classes and the invisible classes, and a visual prototyping diagram is constructed by utilizing the relations among the visual prototypes of all the classes, so that the class relation modeling is more accurate.

In a specific example, in step S6, the obtained matrix E stacked by the adjacent matrix B and the initialized representations of all the nodes is input into an encoder to perform propagation and aggregation of node information;

H ⁽ⁱ⁾ ＝σ(D ^-1/2 BD ^-1/2 H ^(i-1) W ^(i-1) )，i＝1，2

wherein D represents a degree matrix of the adjacency matrix B, D _ii ＝∑ _j B _ij ，W ⁽ⁱ⁾ Representing a parameter matrix, H ⁽⁰⁾ =e represents the initialized input matrix of all nodes, H ² ) The output matrix of the second layer graph convolution, i.e. the matrix of the embedded vectors of all nodes updated on the visual prototype graph, =u

Wherein u is _i ∈R ^d An embedded representation representing node i;

it can be appreciated that the embodiment better promotes the information interaction between semantic representation and visual representation by using the encoder realized by the graph convolution neural network to aggregate and propagate the information of the semantic-like representation on the visual prototype graph to obtain a potential space; meanwhile, the decoder reconstructs an adjacency matrix of the visual prototype graph by utilizing the potential structural representation, so that the embedded vector representation of the node accords with the structure of the graph, and the discriminant of the potential space is improved.

In a specific example, in step S7, after the potential space U is obtained, the visual features and the labels extracted from the visual images by the attention mechanism are embedded into the potential space, and classified in the potential space;

Q _ij ＝f ₂ (v _i )⊙U _j

the total loss function of the final model is:

L＝L _cls +γL _rec

where γ is a hyper-parameter.

In a specific example, in step S8, potential representations for all classes are obtained from the trained models; when a test image x, namely an invisible class image, is given, the visual characteristic v of the image is obtained through a trained attention mechanism, then the visual characteristic v is mapped to a potential space, similarity calculation is carried out on class potential representation, and finally the process of predicting the label is expressed as follows:

in summary, compared with the existing zero sample learning method based on public space embedding, the method removes some kinds of irrelevant information by mining discriminative visual features, so that the obtained class visual prototype is more discriminative; the relation between classes can be accurately modeled without any external information in the composition process; a potential space is obtained through the structure of the self-encoder, the nodes fused with various semantic representations are subjected to information propagation and aggregation on the visual prototype graph by using the graph convolution neural network, information interaction among different modes is promoted, and the knowledge migration capability is improved.

In the zero sample image recognition experiment, the experimental results are shown in table 1. In Table 1, the optimal values for each column are bolded, "-" represents that no experiment was performed on the dataset.

GAFE and ACMR are aided by the structure of the self-encoder. The GAFE learns the mapping of visual features to semantic space using the encoder and the decoder reconstructs the original features using the learned mapping. ACMR utilizes two parallel variational self-encoders to extract visual latent representations and semantic latent representations, respectively, while an information enhancement module is presented to enhance the recognition capability of latent variables. The proposed method learns potential representations by an encoder based on a graph convolution neural network, aggregates and propagates information of semantic-like representations on a visual prototype graph, and learns structured semantic embedding obtained from different spaces to obtain a discriminated potential space by using non-redundant and complementary information provided between a plurality of modalities. At the same time, the decoder section reconstructs the adjacency matrix of the graph using the learned potential embedded representations such that the updated potential embedded representations conform to the structure of the original graph. It can be seen from the table that we propose methods with a greater improvement than these methods, such as compared to ACMR methods, with an improvement of 18.5%,3.8% and 19.8% in the CUB, AWA2 and SUN datasets, respectively.

APNet, HGKT and KG-VAE establish inter-class relationships using graph structures. The APNet performs similarity measurement to generate edges from node feature representations (attribute vectors) and uses the attention mechanism for graph propagation. HGKT firstly models the relation between visible classes according to the representative nodes of the classes under the k-nearest neighbor scheme, and connects the invisible classes with k nearest visible classes in the visual feature space after graph propagation to obtain embedded representations of the invisible classes. And the KG-VAE sends the category semantic vector into a depth neural network module based on the knowledge graph, and the nodes in the graph are aggregated and updated through a graph variation self-encoder to generate a new semantic vector by encoding. The method provided by the invention uses the relation between the visual prototypes of the categories to model, uses semantic representation fused with the attribute word vector and the class attribute as node characteristics, and can effectively fuse various modal information after graph propagation so as to promote information interaction in different spaces. Meanwhile, the adjacency matrix of the potential feature reconstruction is used as a constraint, so that the embedded vector of the node represents the structure of the coincidence graph. Experimental results show that our proposed method is helpful to improve classification accuracy, as compared to KG-VAE method in the patent application of the invention with publication No. CN113505701A, our method improves accuracy by 13.0%,8.3% and 1.5% on CUB, AWA2 and SUN datasets, respectively.

Finally, the proposed method also achieves a larger performance improvement compared to some feature-based methods, such as f-CLSWGAN and LisGAN.

Example 2

To achieve the above object, see fig. 4: the embodiment also provides a zero sample image recognition system, which comprises the following modules:

The zero sample image recognition system of the present embodiment has the same advantages as the above-mentioned zero sample image recognition method compared with the prior art, and will not be described in detail herein.

Example 3

In order to achieve the above object, the present embodiment also provides a computer-readable storage medium storing a computer program which, when executed by a processor, causes the processor to execute the steps of the zero sample image recognition method as described above.

Those skilled in the art will appreciate that all or part of the processes in the methods of the above embodiments may be implemented by a computer program for instructing relevant hardware, where the program may be stored in a non-volatile computer readable storage medium, and where the program, when executed, may include processes in the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), memory bus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A zero sample image recognition method, comprising the steps of:

s7, training a model by using the visible images and the labels;

s8, predicting the invisible image by using the trained model.

2. The zero sample image recognition method according to claim 1, wherein in step S2, an attention mechanism is used to obtain an identifying feature v for each image x to remove irrelevant information;

the specific operation is as follows:

first, K mask blocks M are learned by convolution operation _k ∈R ^W×H ：

M _k ＝σ(Conv(Z))，k＝1，2，...，K

wherein,,

3. The method according to claim 2, wherein in step S3, after the identification visual feature v of each image is obtained, an average operation is performed on all images belonging to the same visual class to obtain a visual prototype P of the visual class ^seen ：

Where n represents the number of samples belonging to the ith visible class.

4. A zero sample image recognition method according to claim 3, wherein in step S4, a relation matrix between a visible class and an invisible class is obtained by calculating cosine similarity between attribute vectors of the visible class and the invisible class

S _ij ＝cos(z _i ，z _j )，i∈C ^s and j∈C ^u

P ^unseen ＝S ^T P ^seen 。

5. The zero-sample image recognition method according to claim 4, wherein in step S5, visual prototypes of all categories are obtained by the above operations, and a visual prototype graph G is constructed by using the relationship between the category visual prototypes, each node representing a category including a visible category and an invisible category;

B _ij ＝cos(P _i ，P _j )

acquiring word vector representations a of attributes of each class using the GloVe model _i ∈R ³⁰⁰ Stacking word vector representations of all attributes of the class into a matrix T E R ^|A|×300 The A| represents the number of class attributes, and the initialization representation of each node is obtained by multiplying the class semantic vectors:

E _i ＝z _i T

6. The zero-sample image recognition method according to claim 5, wherein in step S6, the obtained matrix E in which the adjacent matrix B and the initialized representations of all the nodes are stacked is input into an encoder for propagation and aggregation of node information;

H ⁽ⁱ⁾ ＝σ(D ^-1/2 BD ^-1/2 H ^(i-1) W ^(i-1) )，i＝1，2

Wherein u is _i ∈R ^d An embedded representation representing node i;

7. the zero-sample image recognition method according to claim 6, wherein in step S7, after obtaining the potential space U, the visual features and the labels extracted from the visual class image through the attention mechanism are embedded into the potential space, and classified in the potential space;

Q _ij ＝f ₂ (v _i )⊙U _j

the total loss function of the final model is:

L＝L _cls +γL _rec

where γ is a hyper-parameter.

8. The method of zero sample image recognition according to claim 7, wherein in step S8, potential representations for all classes are obtained by means of the trained model; when a test image x, namely an invisible class image, is given, the visual characteristic v of the image is obtained through a trained attention mechanism, then the visual characteristic v is mapped to a potential space, similarity calculation is carried out on class potential representation, and finally the process of predicting the label is expressed as follows:

9. a zero sample image recognition system, comprising the following modules:

10. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the zero sample image recognition method according to any one of claims 1 to 8.