CN117523278A

CN117523278A - Semantic attention element learning method based on Bayesian estimation

Info

Publication number: CN117523278A
Application number: CN202311473864.0A
Authority: CN
Inventors: 卫保国; 张悦; 王新宇; 苏月童; 李旭; 李立欣
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-02-06

Abstract

The invention relates to a semantic attention element learning method based on Bayesian estimation, which is characterized in that a mean class prototype and a base class prototype are combined to obtain a visual prototype, a class label and a visual prototype of a support set are input to a semantic attention module to obtain a semantic attention prototype, a correction prototype is obtained based on query set data and Bayesian estimation, and the correction prototype and the characteristics of all samples in the query set are input to a meta-training classifier to obtain the prediction probability of a query set image. The correction model is obtained by correcting the original semantic attention prototype by using the visual prototype through Bayesian estimation, the correction model can combine the advantages of the visual prototype and the semantic attention prototype, the gain of the introduced label information when the sample is sparse is reserved, the semantic deviation when the sample is increased is reduced by using the action of the visual prototype, the obtained correction prototype is more similar to the true category prototype, and the generalization performance of the model obtained by the meta-learning method is improved.

Description

Semantic attention element learning method based on Bayesian estimation

Technical Field

The invention relates to the technical field of image classification, in particular to a semantic attention element learning method based on Bayesian estimation.

Background

With the wide application of large-scale data sets and the rapid development of deep convolution architecture, supervised learning plays an important role in multiple fields of computer vision, speech recognition, machine translation and the like. However, in the field of abnormality detection, medical image processing, and the like, which lack large-scale available data, it is difficult to obtain enough training samples to support the supervised learning method, resulting in a decrease in performance of the supervised learning model. In contrast, small sample Learning (FSL) aims to simulate the ability of humans to learn from small amounts of data, whereas meta-Learning is a method for solving small sample Learning problems. Meta-learning enables models to adapt and learn quickly in the face of new tasks by learning over a large number of different tasks, from which common knowledge or patterns are extracted.

Meta-learning is a method for learning how to learn by machine, and is hoped to enable a model to acquire the capacity of adjusting super-parameters so that the model can quickly learn new tasks on the basis of acquiring existing knowledge. In meta learning, a support set (support set) and a query set (query set) are used to describe the data division manner of the task. In machine learning, the training set, validation set and test set are considered primarily around the perspective of the model, while the support set and query set in meta-learning are considered around the task. In the meta-learning training phase, for each task, a task is selected from the training set and split into a support set and a query set. Parameters of the meta-learning model are trained using the support set to enable it to learn shared features or patterns of tasks. The set of queries is used to evaluate the performance of the model on the current task, calculate the loss and update the model parameters. In the meta-learning test phase, a task is selected from the test set and split into a support set and a query set. Parameters of the meta learning model are updated using the support set. The performance of the model on the current task is evaluated using the set of queries to test the generalization ability of the model on the new task.

Semantic knowledge refers to descriptive information of an image that can be utilized, such as category labels, text information of the image, and the like. Thanks to the development of the natural language processing field, we can obtain the characteristics of text information from a large model such as GPT, obtain category tag embedded characteristics from a pre-trained word embedded model such as GloVe, and learn the idea of linking semantic knowledge with visual knowledge by using zero sample, and more small sample learning methods introduce semantic knowledge to make up for atypical of single visual characteristics. For example, the semantic guided attention mechanism (SEGA) in the prior art uses semantic knowledge to guide visual perception from top to bottom, which visual features should be noted when classifying categories, and achieves excellent results on the task of image classification.

However, in the method of combining the semantics and the vision in the prior art, semantic deviation is brought when the number of the semantic attention prototypes is increased, so that the similarity between the feature distribution of the query sample and the feature distribution of the corresponding class prototypes is reduced, and the prototype learning of the new class is influenced.

Disclosure of Invention

Based on the above, it is necessary to provide a learning method for semantic attention elements based on bayesian estimation, which combines the advantages of visual prototypes and semantic attention prototypes through bayesian estimation, retains the gain of introducing label information when samples are sparse, and reduces semantic deviation when samples are added by utilizing the action of the visual prototypes.

The invention provides a Bayesian estimation-based semantic attention element learning method, which comprises the following steps of:

constructing a support set and a query set of a task of meta training according to the base class data set;

pre-training the feature extractor and the pre-training classifier by using the base class data set to obtain a pre-trained classification weight matrix of the feature extractor and the base class;

respectively inputting the support set and the query set into a pre-trained feature extractor to obtain the features of all samples in the support set and the features of all samples in the query set;

calculating the characteristic average value of all samples of each category in the support set to obtain an average value category prototype;

extracting a classification weight vector of a pseudo base class from a classification weight matrix of the base class, and guiding by using the pseudo base class weight according to a Dynamic FSL method to obtain a base class prototype;

combining the mean class prototype and the base class prototype to obtain a visual prototype;

inputting the category labels and the visual prototypes of the support set into a semantic attention module to obtain a semantic attention prototype;

obtaining a correction prototype based on the query set data and the Bayesian estimation, wherein the distribution of the vision prototype is defined as the prior distribution of the Bayesian estimation, the distribution of the semantic attention prototype is a likelihood function of the Bayesian estimation, and the distribution of the correction prototype is a posterior distribution of the Bayesian estimation;

inputting the characteristics of all samples in the correction prototype and the query set into a meta-training classifier to obtain the prediction probability of the query set image;

calculating cross entropy loss by using the distribution of the prediction probability and the real labels of the query set;

updating semantic attention module parameters, meta-training classifier parameters and pseudo-base class weight guidance according to the cross entropy loss to obtain guide parameters of the base class prototype and merging parameters of the mean class prototype and the base class prototype;

repeating the steps until the meta training model converges and meta learning is completed.

In one embodiment, the meta training adopts an N-way K-shot (Q1+Q2) -query scenario training mode, and each scenario randomly selects N categories from a base class data set to serve as pseudo-new categories, and other categories except the pseudo-new categories serve as pseudo-base categories;

constructing a support set of a task of meta training according to the basic class data set, and forming the support set by extracting K samples from each class in the pseudo new class;

the query set of one task trained according to the basic class data set construction element is that Q1 samples are extracted from each class in the pseudo new class, and NxQ 2 samples are randomly extracted from the pseudo basic class, so that the query set is formed.

In one embodiment, pre-training the feature extractor and the pre-training classifier using the base class data set includes:

inputting the base class data set into a feature extraction module to obtain features of the base class data;

inputting the characteristics of the base class data into a pre-training classifier to obtain pre-training prediction probability;

calculating a pre-training cross entropy loss by using the distribution of the pre-training prediction probability and the real labels of the base class data set;

updating the feature extractor and the pre-training classifier according to the cross entropy loss;

and repeating the steps until the pre-training model converges, and obtaining a pre-trained feature extractor and a classification weight matrix of the base class after the pre-training is completed.

In one embodiment, obtaining a correction prototype based on the query set data and the bayesian estimate includes:

estimating sample means and sample covariance of the visual prototype and the semantic attention prototype, respectively, using the query set data, the visual prototype and the semantic attention prototype;

calculating gaussian distribution parameters of the correction prototype by using sample mean values and sample covariance of the visual prototype and the semantic attention prototype;

the final corrected prototype is characterized using the mean of the gaussian parameters of the corrected prototype.

In one embodiment, the Bayesian estimation is

In the method, in the process of the invention,to correct the distribution of prototypes->For the distribution of visual prototypes, < >>Is the distribution of semantic attention prototypes.

In one embodiment, the calculation formula for estimating the sample mean and sample covariance of the visual prototype and the semantic attention prototype, respectively, using the query set data, the visual prototype, and the semantic attention prototype is

Where μ is the sample mean, Σ is the sample covariance, |S _c I is the number of images belonging to category c in the support set, f _θ (x) For feature embedding of query image x obtained by feature extraction, P (y=c|x) is the probability that query image x is predicted to be category c, Q is a query set sample, prototype is a visual prototype or semantic attention prototype, S _c Is a support set sample of category c.

In one embodiment, the probability P (y=c|x) that the query image x predicts as category c is calculated as

Wherein c' is a class number, t is a classifier constraint parameter, t is greater than 1 or t is a learnable scalar parameter, when calculating a distribution parameter of a semantic attention prototype,modified to->

In one embodiment, the calculation of the gaussian distribution parameters of the corrected prototype using the sample mean and sample covariance of the visual prototype and semantic attention prototype is given by

Wherein mu is _c Sigma, the sample mean of the visual prototype _c Sample covariance, μ 'for visual prototype' _c Sigma 'is the sample mean of the semantic attention prototype' _c For the sample covariance of the semantic attention prototype,to correct the sample mean of the prototype, +.>To correct the sample covariance of the prototype.

The beneficial effects of the invention are as follows: the invention corrects the original semantic attention prototype defined from the sampling angle based on Bayesian estimation according to the whole information, can combine the advantages of the visual prototype and the semantic attention prototype, keeps the gain of the introduced label information when the sample is sparse, reduces the semantic deviation when the sample is added by utilizing the action of the visual prototype, and the obtained corrected prototype is closer to the true category prototype, thereby being beneficial to improving the generalization performance of the model obtained by the meta-learning method.

Drawings

FIG. 1 is a schematic flow chart of a learning method based on Bayesian estimation semantic attention elements according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a meta learning framework according to an embodiment of the present invention;

FIG. 3 is one of the flowcharts of the pre-training of feature extractors and pre-training classifiers using base class datasets in an embodiment of the present invention;

FIG. 4 is a schematic flow chart of obtaining a correction prototype based on query set data and Bayesian estimation according to an embodiment of the present invention;

FIG. 5 is a schematic diagram showing a comparison of performance of three class prototypes on a 5-way K-shot image classification task.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

In the existing method for fusing the semantics and the vision, under the condition of giving more samples, the sparsity of the marked samples is relieved, the action of the tag semantic knowledge is gradually weakened, at the moment, the characteristics of the semantic space are weighted to the vision space, so that the semantic noise affecting the vision characteristics is generated, the semantic deviation is brought, the prototype learning of a new category is affected, and the generalization performance of the model obtained by meta learning is reduced. The learning method of semantic attention elements based on Bayesian estimation is oriented to small sample classification, and the generalization performance of a model obtained by element learning can be improved by carrying out Bayesian estimation on prototype vectors.

In one embodiment, as shown in fig. 1 and fig. 2, fig. 1 is one of flow diagrams of a learning method based on bayesian estimation semantic attention element provided in an embodiment of the present invention, and fig. 2 is a structural diagram of a meta learning framework provided in an embodiment of the present invention, where the learning method based on bayesian estimation semantic attention element includes the following steps:

s1001, constructing a support set and a query set of a task of meta training according to a base class data set.

Specifically, the invention is applied to a small sample image classification task of N-way K-shot, and classification of N novel class images with only K marked samples in each class is realized by performing meta-learning on a meta-training data set. Formally, meta-learning has three classes of mutually disjoint datasets:

(1) From the base class data set C _base Training set composed of images

(2) By verification class C _val Image-formed verification set

(3) From new class C _novel Image-formed test set

Wherein x is _i For the ith image, y _i For the class label corresponding to the ith image, |x| represents the total number of images taking the corresponding dataset. Constraint base class, verification class and new class are mutually disjointTraining sets typically have a rich set of marked images, with only a small number of marked images for verification and test sets.

The meta training adopts an N-way K-shot (Q1 +Q 2) -query scene training mode, and each scene randomly selects N categories from a base class data set to serve as pseudo-new categories, and other categories except the pseudo-new categories serve as pseudo-base categories; constructing a support set of a task of meta training according to the basic class data set, and forming the support set by extracting K samples from each class in the pseudo new class; the query set of one task trained according to the basic class data set construction element is that Q1 samples are extracted from each class in the pseudo new class, and NxQ 2 samples are randomly extracted from the pseudo basic class, so that the query set is formed.

In this embodiment, the pseudo-new class isPseudo base class->Support set is-> (x _i Label->) The query set is +.>(x _i Tag y of (2) _i ∈C _base )。

S1002, pre-training the feature extractor and the pre-training classifier by using the base class data set to obtain a pre-trained classification weight matrix of the feature extractor and the base class.

Specifically, as shown in fig. 3, fig. 3 is one of flowcharts of a method for pre-training a feature extractor and a pre-training classifier using a base class data set according to an embodiment of the present invention, where pre-training the feature extractor and the pre-training classifier using the base class data set includes:

s301, inputting the base class data set into a feature extraction module to obtain features of the base class data.

S302, inputting the characteristics of the base class data into a pre-training classifier to obtain pre-training prediction probability.

S303, calculating the pre-training cross entropy loss by using the distribution of the pre-training prediction probability and the real labels of the base class data set.

And S304, updating the feature extractor and the pre-training classifier according to the cross entropy loss.

S305, repeatedly executing the steps until the pre-training model converges, and obtaining a pre-trained feature extractor and a classification weight matrix of the base class after the pre-training is completed.

Wherein the classification weight matrix of the base class is W ^base ＝{w _i |i∈[1，|C _base |]}(w _i Classification for class iWeight vector, |C _base I is the total number of base class categories), the pre-trained classifier is a cosine classifier, and the cosine distance is used to calculate the similarity score sim of the image feature z and the c-th class weight _c (z，w _c ) The prediction probability of the final classifier output image x belonging to the class c is as follows (1):

sim _c (z，w _c )＝t·Cos(z，w _c ) (2)

wherein c' is a class number, t is a learnable scalar value, w _c Class c weight, f _θ (x) Feature embedding, i.e., z, is performed on the query image x through feature extraction.

In pre-training, the loss function uses class cross entropy loss, equation (3) below, to update the feature extractor and pre-training classifier by minimizing the loss.

S1003, respectively inputting the support set and the query set into a pre-trained feature extractor to obtain the features of all samples in the support set and the features of all samples in the query set. The different data sets are input to a feature extractor which obtains features of all samples in the different data sets.

S1004, calculating the characteristic mean value of all samples of each category in the support set to obtain a mean value category prototype.

Specifically, the calculation formula of the mean value prototype is shown in formula (4):

s1005, extracting a classification weight vector of a pseudo base class from a classification weight matrix of the base class, and guiding to obtain a base class prototype by using the pseudo base class weight according to a Dynamic FSL method.

In the invention, the guiding calculation process of using the pseudo base class weight to guide to obtain the base class prototype is shown as a formula (5):

in the method, in the process of the invention,is a weight matrix which can be learned, < +.>To a learnable key value for associating base class features, w _j Is a pseudo-base class weight. Phi (phi) _q Visual feature vector f of pseudo new class support set sample _θ (x _support ) Conversion to a query set sample feature vector f _θ (x _query ) Calculate the query sample f _θ (x _query ) And the base class key k _j The higher the query and j-th key similarity, the more likely the classifier weights corresponding to the j-th base class will be employed.

S1006, merging the mean class prototype and the base class prototype to obtain a visual prototype.

Specifically, the combined calculation formula is

Wherein a is ₁ And a ₂ Is a learnable parameter.

S1007, inputting the category labels and the visual prototypes of the support set into a semantic attention module to obtain a semantic attention prototype. The step utilizes the semantic information of the label to make up the sparsity of the visual prototype.

GloVe is a word characterization tool based on global word frequency statistics that can express a word as a vector of real numbers that captures some semantic properties between words, such as similarity and analogies. The invention is thatWord embedding vectors for category labels are obtained using the GloVe model, and are noted as (gc is the word embedding vector for category c). Mapping the word embedding vector from the semantic space of dg dimension to the visual space of dv dimension to obtain a visual table of word vector +.>Mapping function->Is a multi-layer sensor composed of two linear layers.

Under the guidance of semantic knowledge, selecting important feature dimensions related to the category c, and converting the visual prototype into a semantic attention prototypeThe method comprises the following steps:

wherein,the Hadamard product (per bit) is represented.

S1008, obtaining a correction prototype based on the query set data and the Bayesian estimation, wherein the distribution of the vision prototype is defined as the prior distribution of the Bayesian estimation, the distribution of the semantic attention prototype is a likelihood function of the Bayesian estimation, and the distribution of the correction prototype is a posterior distribution of the Bayesian estimation.

The central limit theorem indicates that a large number of mutually independent random variables approximately follow normal distribution (gaussian distribution) after being properly standardized. Features of one type of image in visual space are clustered, so that the source of the type can be assumedType obeys a Multivariate Gaussian Distribution (MGD). Based on this assumption, the present invention replaces a single mean prototype representation with a gaussian distribution using mean and covariance to describe the probability density function of the prototype. In particular, a visual prototype is providedSemantic attention prototype->Correction prototype

The bayesian theorem is as follows:

where Likelihood is a Likelihood function, prior is a priori distribution, and λ is a regularization constraint constant. Posterior distribution (Posterior) is the conditional probability distribution that results after considering and giving relevant evidence or data.

Since the prior reflects the overall perception of the class prototype, the present invention treats the distribution of visual prototypes as a prior distribution of class prototypesWhen visual prototypes are observed, the distribution of semantic attention prototypes +.>Is considered to be a likelihood function that,

because likelihood functions and semantic attention prototypesThe local information is reflected. The present invention corrects the distribution of prototypes +.>Considered as the rear partAnd (5) checking distribution.

As a front-end and a back-end of the class prototype,and->Respectively, is distinguished by the introduction of semantic information. The process of the present invention for calculating the posterior using the Bayesian theorem can be considered as a correct estimation of the original semantic attention prototype,>defined from a sampling point of view based on the global information. Thus, the Bayesian theorem can be further expressed as

In the method, in the process of the invention,to correct the prototype +.>For vision prototype, ->Is a semantic attention prototype.

As shown in fig. 4, fig. 4 is a schematic flow chart of obtaining a correction prototype based on query set data and bayesian estimation according to an embodiment of the present invention, where obtaining the correction prototype based on the query set data and bayesian estimation includes:

s401, estimating sample mean values and sample covariance of the visual prototype and the semantic attention prototype by using the query set data, the visual prototype and the semantic attention prototype respectively.

S402, calculating Gaussian distribution parameters of the correction prototype by using sample mean values and sample covariance of the visual prototype and the semantic attention prototype.

S403, representing the final correction prototype by using the mean value parameter in the Gaussian parameters of the correction prototype.

Due to correction prototypes, visual prototypes and semantic attention prototypes. All obey a multivariate gaussian distribution having the formula:

synthesizing the formula (9-12), and neglecting constant terms to obtain Gaussian distribution parameters of the correction prototype, wherein the formula is as follows:

wherein mu is _c Sigma, the sample mean of the visual prototype _c Sample covariance, μ 'for visual prototype' _c Sigma 'is the sample mean of the semantic attention prototype' _c For the sample covariance of the semantic attention prototype,to correct the sample mean of the prototype, +.>To correct the originalSample covariance of the pattern.

The present invention uses the mean parameter in the gaussian distribution parameters to characterize the final corrected prototypeNamely, formula (15):

next, 4 unknown parameters μ need to be calculated according to equation (13) (14) _c 、Σ _c 、μ′ _c Σ's' _c Is a value of (2). Because of the sparsity of the samples in the small sample problem setting, the true distribution of the samples is difficult to reflect by using only the support set data to calculate the mean and covariance of the class prototypes, so that a transduction reasoning mode is adopted, and simultaneously, the sample mean mu and the sample covariance sigma of the visual prototypes and the semantic attention prototypes are respectively estimated by using the unlabeled query set data. The formula is as follows:

wherein, |S _c I is the number of images belonging to category c in the support set, f _θ (x) Feature embedding is performed on the query image x through feature extraction. For support set sample S _c The label is known and the probability P of the image x being predicted as category c (y=c|x) is 0 or 1. For the query set sample Q, the labels are unknown, and the prediction probabilities P (y=c|x) corresponding to the respective prototypes are obtained via a cosine classifier. Specifically for μ _c Sum sigma _c ，The probability formula for predicting the query image x as category c is as follows:

for μ' _c Sum sigma' _c ，The probability of category c is simply that in equation (18)Replaced by->

Where c' is a class number for traversing all classes of the support set. t is a meta-training classifier constraint parameter for amplifying cosine distance metric differences, which can be set to a constant value greater than 1 or to a learnable scalar parameter.

S1009, inputting the characteristics of the correction prototype and all samples in the query set into a meta-training classifier to obtain the prediction probability of the query set image.

S1010, calculating cross entropy loss by using the distribution of the prediction probability and the real labels of the query set.

And S1011, updating semantic attention module parameters, meta training classifier parameters and pseudo base class weight guidance according to the cross entropy loss to obtain guide parameters of the base class prototype and merging parameters of the mean class prototype and the base class prototype.

Specifically, the semantic attention module parameter is a multi-layer perceptronParameter of->The parameters of the meta training classifier are t in the formula (2), and the guiding parameters of the pseudo base class weight guiding to obtain the base class prototype are phi in the formula (5) _q The merging parameters for merging the mean class prototype and the base class prototype include a in formula (6) ₁ ，a ₂ 。

S1012, repeating the steps until the meta training model converges and meta learning is completed.

According to the Bayesian estimation-based semantic attention element learning method, the original semantic attention prototype defined from the sampling angle is corrected according to the integral information by using Bayesian estimation, the obtained correction prototype has the advantages of the visual prototype and the semantic attention prototype, the gain of the label information is introduced when the sample is sparse is reserved, the semantic deviation when the sample is added is reduced by utilizing the action of the visual prototype, the obtained correction prototype is closer to the true category prototype, and the generalization performance of the model obtained by the meta learning method, such as the classification performance of a small sample classification model, is improved.

In one embodiment, the effectiveness of the Bayesian estimation based semantic attention element learning method of the present invention is evaluated over Mini-ImageNet datasets. The performance of three types of prototypes, namely a visual prototype, a semantic attention prototype introducing label semantic information and a correction prototype obtained by using the correction strategy provided by the invention, on a small sample image classification task with 5 categories and 1-10 marked sample numbers (5-way K-shot) is compared, and the specific classification accuracy is realized. The performance results on the small sample image classification task are shown in fig. 5 and table 1 below, and fig. 5 is a schematic diagram showing the performance comparison of three types of prototypes on the 5-way K-shot image classification task.

Table 1 comparison of the Performance of three class prototypes on a 5-way K-shot image classification task

As can be seen from fig. 5, the semantic attention prototype performs better when the number of samples k=1 to 3, and especially on a single sample task, the addition of the label semantics to the visual prototype makes the semantic attention prototype alleviate the singleness of the prototype to some extent. The visual prototype shows better performance than the semantic attention prototype when the number of samples k=4-10, which indicates that the visual prototype becomes stable and accurate given more samples, and the semantic information is no longer gain, but side effects are generated, which becomes semantic noise influencing the representativeness of the prototype. The method of the invention has better effects than the K=1-10, and effectively improves the problems that the original semantic attention prototype has weakened semantic advantages and even plays a role when the number of samples is increased. It is worth noting that compared with semantic attention prototypes, the method has excellent classification accuracy in common 1-shot and 5-shot settings, and has great advantages.

In addition, the embodiment also compares the performance of the method and other popular small sample image classification methods, the prototype network is a classical meta-learning method based on measurement, the visual prototype is the basic framework of the invention, and the semantic attention prototype is a meta-learning method for classifying small sample images by using semantic knowledge.

Table 2 lists the experimental results of the above method on the Mini-ImageNet dataset, with "S." representing whether semantic knowledge is utilized, and "Y" if not, "N", bolded for rank first and underlined for rank second. The method of the invention shows optimal performance in both 1-shot and 5-shot tasks. Specifically, on the 1-shot task, compared with the prior meta learning method for classifying small sample images by using semantic knowledge, the classification accuracy of the method is 4.88 percent higher. The method is superior to all other methods in the 5-shot task, has advantages in the 1-shot and 5-shot tasks, further illustrates that the correction prototype provided by the invention is representative, is a more comprehensive prototype, can be quickly adapted to a small sample task, and can achieve advanced image classification performance only by using a plurality of marked samples.

TABLE 2 Performance of small sample image classification methods on Mini-ImageNet datasets

The foregoing examples illustrate only a few embodiments of the invention and are described in detail herein without thereby limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention. Accordingly, the scope of the invention should be assessed as that of the appended claims.

Claims

1. A learning method of semantic attention elements based on Bayesian estimation is characterized by comprising the following steps:

inputting the category labels of the support set and the visual prototypes into a semantic attention module to obtain semantic attention prototypes;

updating semantic attention module parameters, meta training classifier parameters and pseudo base class weight guidance according to the cross entropy loss to obtain guide parameters of a base class prototype and merging parameters of a mean class prototype and the base class prototype;

2. The learning method of semantic attention element based on Bayesian estimation according to claim 1, wherein the element training adopts an N-way K-shot (Q1+Q2) -query scene training mode, each scene randomly selects N categories from a base class data set as pseudo new classes, and other categories except the pseudo new classes are pseudo base classes;

constructing a support set of a task of meta training according to the base class data set, and forming the support set by extracting K samples from each class in the pseudo new class;

and according to the basic class data set, constructing a query set of one task of meta training, namely extracting Q1 samples from each class in the pseudo new class, and randomly extracting N multiplied by Q2 samples from the pseudo basic class to form the query set.

3. The bayesian-estimate-based semantic attention-element learning method of claim 2, wherein pre-training a feature extractor and a pre-training classifier using the base class dataset comprises:

4. A bayesian estimation based semantic attention primitive learning method according to claim 3, wherein obtaining a correction prototype based on query set data and bayesian estimation comprises:

calculating gaussian distribution parameters of a correction prototype by using sample means and sample covariance of the visual prototype and the semantic attention prototype;

5. The bayesian estimation based semantic attention component learning method of claim 4, wherein the bayesian estimation is

6. The method of claim 4, wherein the calculation formula for estimating the sample mean and the sample covariance of the visual prototype and the semantic attention prototype using the query set data, the visual prototype, and the semantic attention prototype, respectively, is as follows

7. The method of claim 6, wherein the probability P (y=c|x) that the query image x is predicted as the category c is calculated as

8. The bayesian-estimated semantic attention primitive learning method of claim 4, wherein the calculation of gaussian distribution parameters for a correction prototype using sample means and sample covariance of the visual prototype and semantic attention prototype is given by the formula