CN113435531A

CN113435531A - Zero sample image classification method and system, electronic equipment and storage medium

Info

Publication number: CN113435531A
Application number: CN202110769269.6A
Authority: CN
Inventors: 李硕豪; 王风雷; 张军; 练智超; 雷军; 李小飞; 蒋林承; 何华; 李千目
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2021-07-07
Filing date: 2021-07-07
Publication date: 2021-09-24
Anticipated expiration: 2041-07-07
Also published as: CN113435531B

Abstract

The method comprises the steps of firstly extracting global features of an input image, then learning the global features based on an attention mechanism to obtain a plurality of feature masks, and calculating an adaptive threshold based on a maximum value in the maximum mask values and a preset adaptive factor; deriving a weighted global feature of the input image based on the adaptive threshold and the plurality of feature masks; calculating compatibility scores of the weighted global features and the semantic embedded vectors of the unseen classes, determining the maximum value of the compatibility scores as the highest compatibility score, and outputting the unseen class corresponding to the highest compatibility score as a class prediction result of the input image, so that the robustness of the features is improved while the redundant features are suppressed through a threshold self-adaptive attention mechanism, and the classification accuracy is further improved.

Description

Zero sample image classification method and system, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image classification technologies, and in particular, to a zero-sample image classification method, system, electronic device, and storage medium.

Background

The zero sample learning is one of small sample learning, the proposition of the concepts is inspired by human learning, and the human can master a new concept only by learning a few examples, even can learn a new concept without examples. The infant can see the apple on the book, and the apple can be easily recognized when the real apple is seen next time. Students can also learn new concepts or matters according to the description of teachers, for example, by learning the description that zebra is a horse with black and white stripes, the students can easily recognize the zebra after seeing the zebra.

The zero sample recognition model performs prediction based on direct semantics, and these methods can be classified into Direct Attribute Prediction (DAP) and Indirect Attribute Prediction (IAP) depending on the way of using intermediate information. Direct attribute prediction is to map the image to attribute space and then complete the prediction of unknown classes through attributes. And (3) indirect attribute prediction, namely mapping the image to a known class space, then mapping the image to an attribute space, and finally completing the class prediction of unknown data through attributes. Models built using the DAP and IAP methods are very interpretable, but both methods place attributes in an overly important position, and mislabeling of attributes can have a large negative impact on the performance of such methods.

Disclosure of Invention

In view of the above, an object of the present disclosure is to provide a zero-sample image classification method, system, electronic device and storage medium.

Based on the above object, the present disclosure provides a zero sample image classification method, which is performed by a pre-trained zero sample classification model, where the zero sample classification model includes a first neural network layer, a reference neural network layer, a full convolution neural network layer, and a second neural network layer, and the method includes:

mapping attribute vectors of a plurality of unseen categories to an image feature space through the first neural network layer to obtain semantic embedded vectors of the unseen categories;

extracting global features of an input image through the reference neural network layer;

learning, by the full convolution neural network layer, the global features based on an attention mechanism to obtain a plurality of feature masks;

for each feature mask in the plurality of feature masks, determining a maximum value of respective element values of the feature mask as a maximum mask value of the feature mask to obtain a plurality of maximum mask values respectively corresponding to the plurality of feature masks;

calculating an adaptive threshold based on a maximum value of the plurality of maximum mask values and a preset adaptive factor;

deriving a weighted global feature of the input image based on the adaptive threshold and the plurality of feature masks;

for each unseen category in the plurality of unseen categories, calculating, by the second neural network layer, a compatibility score of the weighted global feature and a semantic embedding vector of the unseen category to obtain a plurality of compatibility scores corresponding to the plurality of unseen categories, respectively;

determining a maximum value among the plurality of compatibility scores as a highest compatibility score, and outputting the unseen category corresponding to the highest compatibility score as a category prediction result for the input image.

From the above description, it can be seen that the zero-sample image classification method provided by the present disclosure maps, by the first neural network layer, attribute vectors of a plurality of unseen classes to an image feature space to obtain semantic embedded vectors of the plurality of unseen classes; extracting global features of an input image through the reference neural network layer; learning, by the full convolution neural network layer, the global features based on an attention mechanism to obtain a plurality of feature masks; for each feature mask in the plurality of feature masks, determining a maximum value of respective element values of the feature mask as a maximum mask value of the feature mask to obtain a plurality of maximum mask values respectively corresponding to the plurality of feature masks; calculating an adaptive threshold based on a maximum value of the plurality of maximum mask values and a preset adaptive factor; deriving a weighted global feature of the input image based on the adaptive threshold and the plurality of feature masks; for each unseen category in the plurality of unseen categories, calculating, by the second neural network layer, a compatibility score of the weighted global feature and a semantic embedding vector of the unseen category to obtain a plurality of compatibility scores corresponding to the plurality of unseen categories, respectively; determining a maximum value among the plurality of compatibility scores as a highest compatibility score, and outputting the unseen category corresponding to the highest compatibility score as a category prediction result for the input image. Therefore, through a threshold self-adaptive attention mechanism, the robustness of the features is improved while the redundant features are suppressed, and the classification accuracy is further improved.

Drawings

In order to more clearly illustrate the technical solutions in the present disclosure or related technologies, the drawings needed to be used in the description of the embodiments or related technologies are briefly introduced below, and it is obvious that the drawings in the following description are only embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a zero-sample image classification method according to an embodiment of the disclosure;

FIG. 2 is a block diagram of a zero-sample classification model according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a zero-sample image classification system according to an embodiment of the disclosure;

fig. 4 is a hardware structure schematic diagram of a specific electronic device according to an embodiment of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that technical terms or scientific terms used in the embodiments of the present disclosure should have a general meaning as understood by those having ordinary skill in the art to which the present disclosure belongs, unless otherwise defined. The use of "first," "second," and similar terms in the embodiments of the disclosure is not intended to indicate any order, quantity, or importance, but rather to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

As described in the background section, in the scenario setting of the zero sample classification problem, the data classes present in the test set do not exist in the training set, that is, the class set of the test data and the class set of the training data do not intersect. To achieve zero sample target dataset classification, a bridge needs to be set up over two disjoint sets of classes, which is an attribute or semantic embedding space.

In the semantic embedding space, the present disclosure learns one semantic embedding or attribute embedding for each category. Knowing that a tiger is very much like a cat, and not so much like a lion, the tiger and the cat are closer together in semantic space than the lion. Then, the distance between the semantic embedding of the image and the semantic embedding of each category is judged, so that the category can be judged. The distance between embeddings in this disclosure is measured by the compatibility between these two high-dimensional vectors.

In the field of zero-sample classification, especially fine-grained zero-sample classification, the appearance difference of some common attribute representations is often very large, for example, although horses and birds both have tails, the visual difference of the tails is very large. By using the local features for prediction, the network can learn to understand the essence of some parts in the training process, and the common concept is allowed to learn different local features. Common local feature extraction methods include a feature clustering-based method, which generally clusters feature maps of different channels, and the center of the cluster is regarded as an image local feature. The attention-based method adopted by the method is a learnable method, and the attention weight can be obtained through original supervision data training in the network training process without adding any additional marking data. Attention-based approaches are easily embedded in neural networks for end-to-end training.

Similar targets often show different visual angles and different parts in different images, and in order to extract features for different visual angles as much as possible, the number of local region extractors needs to be set as much as possible. However, the specific input image usually only contains a certain view angle of a certain or some local areas, and other local areas often do not appear due to occlusion or view angle, the attention weight for extracting other view angles is small, but the features extracted by a large number of small weights together also have an influence on the model classification result, and if the features are extracted by force, the result is usually half done, and even a reaction is generated. To this end, the present disclosure employs an adaptive threshold approach to eliminate the effects of these unimportant regions.

In an application scenario of the present disclosure, the image classification method of the present disclosure may be implemented by a terminal device, which includes but is not limited to a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, a smart wearable device, a Personal Digital Assistant (PDA) or other electronic devices capable of implementing the above functions.

Similarly, in another application scenario, part or all of the image classification method of the present disclosure may be used as part of another image processing method or a processing method in other fields. For example, the images may be classified by the image classification method of the present disclosure, and then the obtained classification result is used as an input sample of the next processing step.

Referring to fig. 1, a schematic flow chart of a zero-sample image classification method according to an embodiment of the present disclosure is shown, where the method is performed by a pre-trained zero-sample classification model, where the zero-sample classification model includes a first neural network layer, a reference neural network layer, a full convolution neural network layer, and a second neural network layer, and the method includes the following steps:

s101, mapping the attribute vectors of the multiple unseen classes to an image feature space through the first neural network layer to obtain semantic embedded vectors of the multiple unseen classes.

In this step, the attribute vectors of the multiple unseen classes include class attributes of the input image to be classified, the obtained attribute vectors of the multiple unseen classes can be determined according to an application scene of image classification, multiple attribute elements exist in each attribute vector of the unseen classes, and the attribute elements are mapped to an image feature space based on a first neural network layer of a zero sample classification model, so that semantic embedded vectors of the multiple unseen classes can be obtained. If the application scene is used for classifying various horses, the obtained attribute vector of the unseen category is the attribute vector of various horses, wherein each horse has a plurality of attribute elements such as hair color, height, weight and the like, and the semantic embedded vector of each horse can be obtained by mapping the attribute elements to the image feature space. Optionally, the attribute vector may be replaced with a word vector.

It should be noted that the centrality problem (hub clearance) is more likely to occur by projecting the attribute vector of the category into the attribute space. The centrality problem is that in the high-dimensional attribute space, some test classes are very likely to be k neighbors of other data, but there is no correlation between these classes. If the semantic space is used as an embedding space, features need to be mapped from a high-dimensional space to the semantic space, so that the space is shrunk, and points are denser, thereby aggravating the centrality problem. In contrast, the present disclosure uses an image feature space as an embedding space, where the image feature space is obtained by inputting image samples of known classes into the zero sample classification model when performing the zero sample classification model training, and the image feature space belongs to a high-dimensional space, and then maps attribute vectors of the classes to the image feature space to obtain the semantic embedding vectors. Therefore, the problem of accentuation of centrality is avoided, and meanwhile, as the image feature space is obtained during zero-sample classification model training, the relevance of each class element is easier to find when the image feature space participates in the subsequent calculation of the compatibility score.

And S102, extracting the global features of the input image through the reference neural network layer.

In this step, the global features of the input image are extracted through the reference neural network layer of the zero sample classification model.

S103, learning the global features based on an attention mechanism through the full convolution neural network layer to obtain a plurality of feature masks.

In this step, after obtaining the global feature, the global feature is further learned based on an attention mechanism through the full convolution neural network layer to obtain a plurality of feature masks, and each feature mask is used for extracting a local region of the input image.

S104, for each feature mask in the plurality of feature masks, determining a maximum value of the respective element values of the feature mask as a maximum mask value of the feature mask, so as to obtain a plurality of maximum mask values respectively corresponding to the plurality of feature masks.

In this step, each feature mask has a plurality of element values, each element value represents a feature weight of a position corresponding to the element value, the larger the element value is, the larger the corresponding weight is, and the maximum value is selected from all the element values of each feature mask as the maximum mask value of the feature mask. So that each feature mask corresponds to a maximum mask value.

And S105, calculating an adaptive threshold value based on the maximum value in the plurality of maximum mask values and a preset adaptive factor.

In this step, a maximum value is selected from the maximum mask values corresponding to the feature masks, and an adaptive threshold is calculated according to the maximum value and a preset adaptive factor, where the preset adaptive factor may be set as needed, and optionally, the maximum value is multiplied by the preset adaptive factor to obtain the adaptive threshold.

S106, obtaining the weighted global features of the input image based on the adaptive threshold and the feature masks.

In this step, a weighted global feature of the input image is obtained according to the adaptive threshold and the feature masks. The adaptive threshold is used for judging which local features are invalid features and can be eliminated or weakened, so that the characteristics of the finally obtained local features can better reflect the features of the image to be classified, redundant features are inhibited, and the robustness of the features is improved.

In order to eliminate the influence of invalid features on image classification, in some embodiments, obtaining a weighted global feature of the input image based on the adaptive threshold and the feature masks specifically includes:

adaptively thresholding the plurality of feature masks by: for each feature mask of the plurality of feature masks, resetting each element value of the feature mask to zero in response to determining that a maximum mask value of the feature mask is less than the adaptive threshold;

and weighting the global features by taking the plurality of feature masks subjected to the adaptive threshold processing as attention weights to obtain the weighted global features.

Specifically, after the feature masks corresponding to the unimportant features are selected through the adaptive threshold, each element value of the feature masks is reset to zero, that is, the weight corresponding to the feature masks reset to zero is changed to 0, so that the influence of the unimportant features on the image classification can be eliminated.

Further, considering that the weighted global features obtained by directly removing some unimportant features are lack of smoothness and do not conform to the real features of the input image, in some embodiments, obtaining the weighted global features of the input image based on the adaptive threshold and the feature masks specifically includes:

adaptively thresholding the plurality of feature masks by: for each feature mask of the plurality of feature masks, in response to determining that a maximum mask value of the feature mask is less than the adaptive threshold, squaring each element value of the feature mask;

Specifically, after selecting the feature masks corresponding to the unimportant features through the adaptive threshold, each element value of the feature masks is subjected to square processing, and each element value represents the weight of the extracted features, namely, each element value is smaller than 1, so that the square processing is equivalent to weakening the weight, and meanwhile, the square processing is equivalent to weighting the same feature twice according to the original weight instead of weakening randomly.

Further, in some embodiments, the same operations as described above are used, after selecting a feature mask corresponding to an insignificant feature by adaptive thresholding, to multiply each element value of the feature mask by a self-attenuating weight, the self-attenuation weight is a ratio of a preset weight factor to the number of the plurality of feature masks, the preset weight factor can be set according to the requirement, the self-attenuation weight is used for weakening the element value of the feature mask corresponding to the selected unimportant feature, and how many of the number of the plurality of feature masks indicates how many local regions the input image is focused on, when there are more local regions of interest, the degree to which unimportant features are attenuated may be increased by a suitable amount, thereby highlighting the characteristics of those important features while preserving as much as possible all features as possible when the local area of interest is small to better characterize the real input image. Alternatively, unimportant features may be attenuated by other means, such as dividing each element value of the corresponding feature mask by 2 or other integer, which are within the scope of the present disclosure.

S107, for each unseen category in the unseen categories, calculating compatibility scores of the weighted global features and the semantic embedding vectors of the unseen category through the second neural network layer to obtain a plurality of compatibility scores respectively corresponding to the unseen categories.

In this step, after the weighted global features and the semantic embedded vectors of the unseen classes are obtained, compatibility scores of the weighted global features and the semantic embedded vectors of each unseen class are calculated through the second neural network layer, so that a plurality of compatibility scores corresponding to the unseen classes are obtained.

In some embodiments, calculating, by the second neural network layer, a compatibility score of the weighted global features with the semantic embedding vector of the unseen category specifically includes:

mapping, by the second neural network layer, class elements of the semantic embedding vector to compatibility class elements, the number of compatibility class elements being the same as the number of feature elements of the weighted global features;

and taking all the compatibility category elements and the feature elements as an element whole, and calculating the linear combination of the element whole based on the preset compatibility function to obtain the compatibility score of the semantic embedded vector of the weighted global feature and the unseen category.

Specifically, the number of the compatibility category elements is determined according to the number of the feature elements of the weighted global feature through the second neural network layer, namely the compatibility category elements are the same, then the category elements of the semantic embedded vector are mapped into the compatibility category elements, so that the number of the feature elements of the weighted global feature is ensured to be the same as the number of the category elements of the mapped semantic embedded vector, then all the compatibility category elements and the feature elements are taken as an element whole, and the linear combination of the element whole is calculated on the basis of the preset compatibility function, so that the compatibility score of the weighted global feature and the semantic embedded vector of the unseen category is obtained.

Optionally, calculating a compatibility score of the weighted global feature and the semantic embedded vector of the unseen category, and implementing by the following formula:

F(x,y；w)＝w₁x₁+w₂y₁+w₃x₂+w₄y₂+…+w_mx_n+w_m+1y_n；

wherein F (x, y; w) represents a preset compatibility function, x represents a weighted global feature, and x represents a global feature₁、x₂、…x_nFeature elements representing weighted global features, y representing semantic embedded vectors, y₁、y₂、…y_nCompatibility category elements that represent semantic embedding vectors. W denotes the parameters of a predetermined compatibility function, W₁、w₂、…w_m+1Parameters representing preset compatibility functions corresponding to the respective elements. Optionally, the preset compatibility function may be obtained through training, and parameters of each preset compatibility function are revised.

Alternatively, the above formula may be replaced by the following formula:

F(x,y；w)＝w₁(x₁+y₁)+w₂(x₂+y₂)+…+w_n(x_n+y_n)；

wherein, the meaning of each letter is the same as that of the same letter, and is not repeated herein.

And S108, determining the maximum value in the plurality of compatibility scores as the highest compatibility score, and outputting the unseen category corresponding to the highest compatibility score as a category prediction result of the input image.

In this step, after obtaining a plurality of compatibility scores corresponding to the plurality of unseen categories, respectively, a maximum value is selected from the plurality of compatibility scores as a highest compatibility score, and the unseen category corresponding to the highest compatibility score is output as a category prediction result for the input image. The compatibility score represents the distance of the weighted global features of the input image from the semantic embedding vector of the respective unseen category.

In the process of training the neural network model, the image outputs a vector z with the same dimension and class number after being calculated by the neural network model to represent the scoring of various predictions, such as z_iA predictive score indicating that the image belongs to category i. The prediction scoring is usually normalized by softmax to obtain the prediction probability of the image, and is represented by q, and q meets the basic condition of probability distribution. In the prior art, the true probability distribution of a class is constructed by the following formula:

where y represents the true category label of the image. The method adopts an isolated visual angle to see the classification of each category, omits the correlation relation among the categories, only focuses on maximizing the mark category, but treats all other categories as the same thing, and increases the risk of model overfitting. Therefore, the present disclosure uses the smooth label to construct the true probability distribution of the image category label, and optionally, the true probability distribution of the category is specifically constructed by the following formula:

where ε is a small constant and N is the class of classes. After the true probability distribution of the image class labels is constructed by smoothing the labels, the present disclosure updates the parameters of the zero sample learning classification model with a minimized cross entropy loss to approximate the prediction probability distribution to the true probability distribution.

In some embodiments, to further improve the accuracy of classification, the image to be classified may be subjected to data augmentation including one or more of image scale normalization, image stochastic cropping, image numerical normalization, image flipping, image scaling, image rotation, and image tilting before extracting global features of the input image by the reference neural network layer. Optionally, after the data of the image to be processed is augmented, the augmented images are classified by the method of the present disclosure, and then all classification results are integrated to determine the category of the image to be classified.

Referring to fig. 2, a schematic diagram of a framework structure of a zero-sample classification model according to an embodiment of the present disclosure is shown, where a rectangular box represents data transformation, a rounded rectangular box represents an input image and intermediate data, the classification model has an upper data stream branch and a lower data stream branch, in the lower branch, the input image is subjected to data amplification and transformation processing, and then input into a reference neural network layer of the zero-sample classification model to perform feature extraction, after obtaining global features output by the reference neural network layer, local attention learning is performed on the global features, and the global features are subjected to weighting processing through learned attention weights. In attention weight learning, the weights are adjusted by adaptive thresholds. The purpose of local attention learning is to find local regions with discriminant properties in an image, and therefore, the process of attention learning can be regarded as a process of selecting local objects with discriminant regions from global features, and features weighted by attention can be regarded as local features of the whole image. In the above branch, attribute vectors or category word embedding vectors of unseen categories are input, and mapped to an image feature space of an image after a group of linear transformation, that is, a full connection layer is input, so as to obtain semantic embedding vectors of categories. And finally, completing the estimation of the image category by calculating the compatibility score of the weighted global features of the image and the semantic embedded vector.

According to the zero-sample image classification method provided by the disclosure, attribute vectors of a plurality of unseen classes are mapped to an image feature space through the first neural network layer so as to obtain semantic embedded vectors of the unseen classes; extracting global features of an input image through the reference neural network layer; learning, by the full convolution neural network layer, the global features based on an attention mechanism to obtain a plurality of feature masks; for each feature mask in the plurality of feature masks, determining a maximum value of respective element values of the feature mask as a maximum mask value of the feature mask to obtain a plurality of maximum mask values respectively corresponding to the plurality of feature masks; calculating an adaptive threshold based on a maximum value of the plurality of maximum mask values and a preset adaptive factor; deriving a weighted global feature of the input image based on the adaptive threshold and the plurality of feature masks; for each unseen category in the plurality of unseen categories, calculating, by the second neural network layer, a compatibility score of the weighted global feature and a semantic embedding vector of the unseen category to obtain a plurality of compatibility scores corresponding to the plurality of unseen categories, respectively; determining a maximum value among the plurality of compatibility scores as a highest compatibility score, and outputting the unseen category corresponding to the highest compatibility score as a category prediction result for the input image. Therefore, through a threshold self-adaptive attention mechanism, the robustness of the characteristics is improved while the redundant characteristics are suppressed, meanwhile, the probability distribution of the real sample is constructed in a smooth class label mode, in the process of transmission loss, the model learning is allowed, the interrelation among different classes is utilized, and the identification precision of the model is improved.

It should be noted that the method of the embodiments of the present disclosure may be executed by a single device, such as a computer or a server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may only perform one or more steps of the method of the embodiments of the present disclosure, and the devices may interact with each other to complete the method.

It should be noted that the above describes some embodiments of the disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments described above and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Based on the same inventive concept, corresponding to the method of any embodiment, the disclosure also provides a zero sample image classification system. The system is executed through a pre-trained zero sample classification model, and the zero sample classification model comprises a first neural network layer, a reference neural network layer, a full convolution neural network layer and a second neural network layer.

Referring to fig. 3, the zero-sample image classification system includes:

a semantic module 301, which maps the attribute vectors of a plurality of unseen classes to an image feature space through the first neural network layer to obtain semantic embedded vectors of the plurality of unseen classes;

a global feature module 302, which extracts global features of the input image through the reference neural network layer;

an attention module 303, configured to learn, by the full convolution neural network layer, the global feature based on an attention mechanism to obtain a plurality of feature masks;

a mask determining module 304, configured to determine, for each feature mask of the plurality of feature masks, a maximum value of the respective element values of the feature mask as a maximum mask value of the feature mask, so as to obtain a plurality of maximum mask values respectively corresponding to the plurality of feature masks;

an adaptive threshold module 305 for calculating an adaptive threshold based on a maximum value among the plurality of maximum mask values and a preset adaptive factor;

a weighting module 306 for deriving a weighted global feature based on the adaptive threshold and the maximum mask values;

a compatibility module 307, for each unseen category in the unseen categories, calculating, by the second neural network layer, a compatibility score of the weighted global feature and a semantic embedding vector of the unseen category to obtain a plurality of compatibility scores corresponding to the unseen categories, respectively;

the prediction module 308 determines a maximum value of the plurality of compatibility scores as a highest compatibility score, and outputs the unseen category corresponding to the highest compatibility score as a category prediction result for the input image.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Optionally, any two or more of the modules may be combined into one module, and the functions of the two modules are implemented simultaneously. Of course, the functionality of the various modules may be implemented in the same one or more software and/or hardware implementations of the present disclosure.

The apparatus of the foregoing embodiment is used to implement the corresponding zero-sample image classification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-mentioned embodiments, the present disclosure further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the zero-sample image classification method according to any of the above-mentioned embodiments when executing the program.

Fig. 4 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 1010, a memory 1020, an input/output interface 1030, a communication interface 1040, and a bus 1050. Wherein the processor 1010, memory 1020, input/output interface 1030, and communication interface 1040 are communicatively coupled to each other within the device via bus 1050.

The processor 1010 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present disclosure.

The Memory 1020 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random Access Memory), a static storage device, a dynamic storage device, or the like. The memory 1020 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 1020 and called to be executed by the processor 1010.

The input/output interface 1030 is used for connecting an input/output module to input and output information. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 1040 is used for connecting a communication module (not shown in the drawings) to implement communication interaction between the present apparatus and other apparatuses. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

Bus 1050 includes a path that transfers information between various components of the device, such as processor 1010, memory 1020, input/output interface 1030, and communication interface 1040.

It should be noted that although the above-mentioned device only shows the processor 1010, the memory 1020, the input/output interface 1030, the communication interface 1040 and the bus 1050, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

The electronic device of the above embodiment is used to implement the corresponding zero-sample image classification method in any of the foregoing embodiments, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Based on the same inventive concept, corresponding to any of the above-described embodiment methods, the present disclosure also provides a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the zero-sample image classification method according to any of the above embodiments.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The computer instructions stored in the storage medium of the above embodiment are used to enable the computer to execute the zero sample image classification method according to any of the above embodiments, and have the beneficial effects of the corresponding method embodiments, which are not described herein again.

It should be noted that the embodiments of the present disclosure can be further described in the following ways:

a zero-sample image classification method is executed by a pre-trained zero-sample classification model, wherein the zero-sample classification model comprises a first neural network layer, a reference neural network layer, a full convolution neural network layer and a second neural network layer, and the method comprises the following steps:

Optionally, obtaining a weighted global feature of the input image based on the adaptive threshold and the feature masks specifically includes:

adaptively thresholding the plurality of feature masks by: for each feature mask of the plurality of feature masks, in response to determining that a maximum mask value of the feature mask is less than the adaptive threshold, multiplying each element value of the feature mask by a self-attenuating weight, the self-attenuating weight being a ratio of a preset weight factor to a number of the plurality of feature masks;

Optionally, calculating, by the second neural network layer, a compatibility score of the weighted global feature and the semantic embedded vector of the unseen category includes:

Optionally, when the zero sample classification model is trained, a true probability distribution of an image class label is constructed based on a smooth label, and a parameter of the zero sample classification model is updated by using a minimum cross entropy loss, so that a prediction probability distribution is close to the true probability distribution.

Optionally, before extracting the global features of the input image through the reference neural network layer, the method further includes:

and carrying out data augmentation on the image to be classified, wherein the data augmentation comprises one or more of image scale normalization, image random cutting, image numerical value normalization, image turning, image scaling, image rotation and image inclination.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the idea of the present disclosure, also technical features in the above embodiments or in different embodiments may be combined, steps may be implemented in any order, and there are many other variations of the different aspects of the embodiments of the present disclosure as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures for simplicity of illustration and discussion, and so as not to obscure the embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring embodiments of the present disclosure, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the embodiments of the present disclosure are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that the embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

The disclosed embodiments are intended to embrace all such alternatives, modifications and variances which fall within the broad scope of the appended claims. Therefore, any omissions, modifications, equivalents, improvements, and the like that may be made within the spirit and principles of the embodiments of the disclosure are intended to be included within the scope of the disclosure.

Claims

1. A zero-sample image classification method is executed by a pre-trained zero-sample classification model, wherein the zero-sample classification model comprises a first neural network layer, a reference neural network layer, a full convolution neural network layer and a second neural network layer, and the method comprises the following steps:

2. The method of claim 1, wherein deriving the weighted global features of the input image based on the adaptive threshold and the plurality of feature masks comprises:

3. The method of claim 1, wherein deriving the weighted global features of the input image based on the adaptive threshold and the plurality of feature masks comprises:

4. The method of claim 1, wherein deriving the weighted global features of the input image based on the adaptive threshold and the plurality of feature masks comprises:

5. The method according to claim 1, wherein calculating, by the second neural network layer, a compatibility score of the weighted global features with the unseen class of semantic embedded vectors comprises:

6. The method of claim 1, wherein in the zero sample classification model training, a true probability distribution of image class labels is constructed based on smooth labels, and parameters of the zero sample classification model are updated with minimized cross entropy loss to approximate a predicted probability distribution to the true probability distribution.

7. The method of claim 1, wherein prior to extracting global features of an input image by the reference neural network layer, the method further comprises:

8. An image classification system implemented by a pre-trained zero-sample classification model, the zero-sample classification model comprising a first neural network layer, a reference neural network layer, a full convolution neural network layer, and a second neural network layer, the system comprising:

the semantic module is used for mapping the attribute vectors of the plurality of unseen categories to an image feature space through the first neural network layer so as to obtain semantic embedded vectors of the plurality of unseen categories;

the global feature module extracts global features of the input image through the reference neural network layer;

an attention module for learning the global feature based on an attention mechanism through the full convolutional neural network layer to obtain a plurality of feature masks;

a mask determining module, configured to determine, for each of the feature masks, a maximum value among respective element values of the feature mask as a maximum mask value of the feature mask, so as to obtain a plurality of maximum mask values respectively corresponding to the feature masks;

an adaptive threshold module which calculates an adaptive threshold based on a maximum value of the plurality of maximum mask values and a preset adaptive factor;

a weighting module for obtaining a weighted global feature based on the adaptive threshold and the maximum mask values;

a compatibility module for calculating, by the second neural network layer, for each unseen category in the plurality of unseen categories, a compatibility score for the weighted global feature and a semantic embedding vector for that unseen category to obtain a plurality of compatibility scores corresponding to the plurality of unseen categories, respectively;

a prediction module that determines a maximum value among the plurality of compatibility scores as a highest compatibility score, and outputs the unseen category corresponding to the highest compatibility score as a category prediction result for the input image.

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 7 when executing the program.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any of claims 1 to 7.