CN111914929B

CN111914929B - Zero sample learning method

Info

Publication number: CN111914929B
Application number: CN202010750578.4A
Authority: CN
Inventors: 罗新新; 蔡子赟
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-07-30
Filing date: 2020-07-30
Publication date: 2022-08-23
Anticipated expiration: 2040-07-30
Also published as: CN111914929A

Abstract

The present invention provides a zero-sample learning method, which recognizes never-before-seen data categories by performing knowledge transfer from visible category samples to invisible category samples, which mainly includes the following steps: acquiring a training feature data set; building a generative-based network , noise autoencoder, regression network and zero-shot learning model of discriminant network; train generative network and noise autoencoder; train discriminant network; obtain the overall objective function, and iterate to achieve the purpose of algorithm optimization. Through knowledge transfer, the invention fuses two semantic features of attribute and word vector, conducts training under the confrontation mechanism to minimize the distribution difference between real samples and generated samples, and maps visual features to semantic features through regression network, which effectively solves the problem. The problem of neighborhood transfer of model prediction results can identify difficult-to-label examples and reduce the cost of identification.

Description

zero-shot learning method

技术领域technical field

本发明涉及一种零样本学习方法，属于模式识别领域。The invention relates to a zero-sample learning method, which belongs to the field of pattern recognition.

背景技术Background technique

随着深度学习的发展，计算机视觉与机器学习方法的性能都取得很大的提高，并且深度学习模型已经在图像分类领域取得了令人惊讶的成功，甚至可以与人类的识别能力相媲美。但是，人类在识别新颖物体方面具有天然的优势，这些物体人类从前只是听说过或是见过几次，也有可能是从未接触到的新物体。造成这两者区别的最根本的原因是深度模型依赖于完全监督学习。因此训练神经网络需要大量的经过标注的数据，实际上，由于自然界的物种数以万计，收集和注释视觉数据既麻烦又昂贵。由此产生了一种新的任务，通过将知识从可见类别样本转移到不可见类别样本，从而可以识别不可见类别样本，以解决图像标注的问题。With the development of deep learning, the performance of both computer vision and machine learning methods has been greatly improved, and deep learning models have achieved surprising success in the field of image classification, even comparable to human recognition ability. However, humans have a natural advantage in recognizing novel objects, which humans have only heard of or seen a few times before, or new objects that they have never encountered before. The most fundamental reason for the difference between the two is that deep models rely on fully supervised learning. Training neural networks therefore requires a large amount of annotated data, and in fact, due to the tens of thousands of species in nature, collecting and annotating visual data is cumbersome and expensive. This leads to a new task that can identify unseen class samples by transferring knowledge from visible class samples to unseen class samples to solve the problem of image annotation.

零样本学习目前受到了越来越多的关注，在零样本学习中，通常假设可见类与不可见类集合是不相交的。在特征空间中有一部分样本是有标注的，这些样本称为可见类别样本，并且只有可见类别样本的视觉实例用来训练模型。在特征空间还有一部分未被标注的样本实例，这些样本类别称为不可见类别样本。特征空间是样本经过神经网络提取出的向量组成，而且每一个样本属于一个类别。为了建立可见类别样本与不可见类别样本的联系，通常会为零样本学习引入语义特征。在零样本学习中，属性是最为常用的语义特征，但是为每种语义属性手动标注视觉特征，是一件既费时又费力的发明。而自然语言处理技术利用了一些可替换属性的语义特征(例如词向量，glove)，直接从维基百科文章中获取文本信息，但是这类语义特征由于获取粗糙且不可见，所以性能相比属性特征较差。Zero-shot learning is currently receiving more and more attention. In zero-shot learning, it is usually assumed that the set of visible and unseen classes is disjoint. A part of the samples in the feature space are labeled, these samples are called visible class samples, and only the visual instances of the visible class samples are used to train the model. There are also some unlabeled sample instances in the feature space, and these sample categories are called invisible category samples. The feature space is composed of vectors extracted by the samples through the neural network, and each sample belongs to a category. In order to establish the connection between visible class samples and unseen class samples, semantic features are usually introduced from zero-shot learning. In zero-shot learning, attributes are the most commonly used semantic features, but manually annotating visual features for each semantic attribute is a time-consuming and laborious invention. Natural language processing technology uses some semantic features of replaceable attributes (such as word vector, glove) to directly obtain text information from Wikipedia articles, but such semantic features are rough and invisible, so their performance is higher than that of attribute features. poor.

有鉴于此，确有必要提出一种零样本学习方法，以解决上述问题。In view of this, it is indeed necessary to propose a zero-shot learning method to solve the above problems.

发明内容SUMMARY OF THE INVENTION

本发明的目的在于提供一种零样本学习方法，用以对难以标注的样例进行识别并减小识别成本。The purpose of the present invention is to provide a zero-sample learning method for recognizing difficult-to-label samples and reducing the recognition cost.

为实现上述目的，本发明提供了一种零样本学习方法，用于从可见类别样本到不可见类别样本进行知识迁移，以识别从未见过的数据类别，主要包括以下步骤：In order to achieve the above purpose, the present invention provides a zero-sample learning method, which is used for knowledge transfer from visible class samples to invisible class samples to identify data classes that have never been seen before, which mainly includes the following steps:

步骤1、获取训练特征数据集，所述训练特征数据集包括可见类别样本，其中可见类别样本包括标签、真实视觉特征和语义特征；Step 1, obtaining a training feature data set, the training feature data set includes visible class samples, wherein the visible class samples include labels, real visual features and semantic features;

步骤2、搭建基于生成网络、噪声自编码器、回归网络以及判别网络的零样本学习模型，并对零样本学习模型中的生成网络、噪声自编码器、回归网络以及判别网络进行初始化；Step 2. Build a zero-sample learning model based on a generative network, a noise autoencoder, a regression network, and a discriminant network, and initialize the generative network, noise autoencoder, regression network, and discriminant network in the zero-shot learning model;

步骤3、训练生成网络和噪声自编码器，以分别生成第一视觉特征和第二视觉特征，并将第一视觉特征和第二视觉特征根据不同权重融合成伪视觉特征；Step 3, train the generation network and the noise autoencoder to generate the first visual feature and the second visual feature respectively, and fuse the first visual feature and the second visual feature into pseudo visual features according to different weights;

步骤4、训练判别网络，以对所述伪视觉特征和所述真实视觉特征进行分类，并通过对抗机制优化生成网络和判别网络；Step 4, training the discriminant network to classify the pseudo visual features and the real visual features, and optimize the generation network and the discriminant network through the confrontation mechanism;

步骤5、训练回归网络，将所述伪视觉特征作为输入，以将伪视觉特征映射到语义特征；Step 5, training a regression network, using the pseudo-visual features as input, to map the pseudo-visual features to semantic features;

步骤6、将生成网络、噪声自编码器、回归网络和判别网络的损失函数相加，以得到总目标函数，并进行迭代以达到算法优化的目的。Step 6. Add the loss functions of the generative network, the noise autoencoder, the regression network and the discriminant network to obtain the total objective function, and iterate to achieve the purpose of algorithm optimization.

可选的，步骤1中，所述标签包括数量标签和类别标签，所述语义特征包括词向量和属性特征。Optionally, in step 1, the labels include quantity labels and category labels, and the semantic features include word vectors and attribute features.

可选的，步骤2中，所述生成网络、噪声自编码器、回归网络和判别网络之间使用前馈神经网络进行数据传递。Optionally, in step 2, a feedforward neural network is used for data transfer among the generating network, the noise autoencoder, the regression network and the discriminant network.

可选的，步骤3中，所述生成网络通过属性特征和高斯随机噪声生成第一视觉特征；所述噪声自编码器通过词向量、潜在变量和高斯随机噪声生成第二视觉特征。Optionally, in step 3, the generating network generates the first visual feature through attribute features and Gaussian random noise; the noise autoencoder generates the second visual feature through word vectors, latent variables and Gaussian random noise.

可选的，步骤3中，第一视觉特征和第二视觉特征融合成伪视觉特征的公式为：Optionally, in step 3, the formula for fusing the first visual feature and the second visual feature into a pseudo visual feature is:

其中，x_f是伪视觉特征，λ是相应的权重，x₁是第一视觉特征表示，

是第二视觉特征表示，第一视觉特征和第二视觉特征两个部分的权重之和为1。where x _f is the pseudo visual feature, λ is the corresponding weight, x ₁ is the first visual feature representation,

is the second visual feature representation, and the sum of the weights of the first visual feature and the second visual feature is 1.

可选的，步骤4中，所述对抗机制可以表示为：Optionally, in step 4, the confrontation mechanism can be expressed as:

其中，x是真实视觉特征，x_f＝G(a,w,z)，

α～U(0,1)。where x is the real visual feature, x _f =G(a,w,z),

α～U(0,1).

可选的，步骤4中，所述伪视觉特征和所述真实视觉特征均通过最小二乘损失公式约束其分布，所述最小二乘损失公式为：Optionally, in step 4, the pseudo visual features and the real visual features are both constrained by the least squares loss formula, and the least squares loss formula is:

其中，x是真实视觉特征，x_f是伪视觉特征。where x is the real visual feature and x _f is the pseudo visual feature.

可选的，步骤6中的总目标函数为：Optionally, the overall objective function in step 6 is:

L＝L_WGAN+L₁+λ₂*L_R,L=L _WGAN +L ₁ +λ ₂ *L _R ,

其中，λ₂是在不同部分分配权重的超参数。where _λ2 is a hyperparameter that assigns weights to different parts.

可选的，步骤6中，使用Adam作为优化器进行算法优化。Optionally, in step 6, use Adam as an optimizer to perform algorithm optimization.

可选的，还包括步骤7、将步骤3中训练好的生成网络用于不可见类别样本的真实视觉特征生成，并对其分类，以测试步骤6中的所述总目标函数。Optionally, step 7 is also included, using the generating network trained in step 3 to generate real visual features of samples of invisible categories, and classifying them to test the total objective function in step 6.

本发明的有益效果是：本发明通过知识迁移，融合属性和词向量两种语义特征，在对抗机制下进行训练以最小化真实样本和生成样本的分布差异，并通过回归网络将视觉特征映射到语义特征，有效地解决了模型预测结果邻域迁移的问题，可以对难以标注的样例进行识别，同时减小识别成本。The beneficial effects of the present invention are as follows: the present invention integrates two semantic features of attribute and word vector through knowledge transfer, conducts training under the confrontation mechanism to minimize the distribution difference between real samples and generated samples, and maps visual features to Semantic features can effectively solve the problem of neighborhood migration of model prediction results, and can identify samples that are difficult to label, while reducing the cost of identification.

附图说明Description of drawings

图1是本发明零样本学习方法的流程示意图。FIG. 1 is a schematic flowchart of the zero-sample learning method of the present invention.

图2是本发明零样本学习方法中生成第一视觉特征的流程图。FIG. 2 is a flow chart of generating a first visual feature in the zero-sample learning method of the present invention.

图3是本发明零样本学习方法中生成第二视觉特征的流程图。FIG. 3 is a flow chart of generating a second visual feature in the zero-sample learning method of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案和优点更加清楚，下面结合附图和具体实施例对本发明进行详细描述。In order to make the objectives, technical solutions and advantages of the present invention clearer, the present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

如图1所示，本发明揭示了一种零样本学习方法，用于从可见类别样本到不可见类别样本进行知识迁移，以识别从未见过的数据类别，主要包括以下步骤：As shown in Fig. 1, the present invention discloses a zero-sample learning method, which is used for knowledge transfer from visible class samples to invisible class samples to identify data classes that have never been seen before, which mainly includes the following steps:

以下将对步骤1-步骤6进行详细说明。Steps 1 to 6 will be described in detail below.

在步骤1中，训练特征数据集是由深层卷积神经网络提取出的2048维视觉特征，是一个向量组；所述标签包括数量标签和类别标签，所述语义特征包括词向量和属性特征。训练特征数据集在深层卷积神经网络模型RstNet101顶层池化单元的2048维视觉功能下，表现出了卓越的性能。对于AwA1和AwA2数据库，除了使用视觉特征作为语义特征之外，还通过使用词向量来表示每个类别，其中每个类别词向量的维数为1000。具体来说，即利用自然语言处理技术从大型语言语料库中每个类别进行提取词向量。至于属性特征，通过连续值语义属性，其维数如下表1所示。In step 1, the training feature dataset is a 2048-dimensional visual feature extracted by a deep convolutional neural network, which is a vector group; the labels include quantity labels and category labels, and the semantic features include word vectors and attribute features. The training feature dataset shows excellent performance under the 2048-dimensional visual function of the top-level pooling unit of the deep convolutional neural network model RstNet101. For the AwA1 and AwA2 databases, in addition to using visual features as semantic features, each category is also represented by using word vectors, where the dimension of each category word vector is 1000. Specifically, it uses natural language processing technology to extract word vectors from each category in a large language corpus. As for attribute features, through continuous-valued semantic attributes, the dimensions are shown in Table 1 below.

表1Table 1

在步骤2中，生成网络、噪声自编码器、回归网络和判别网络使用前馈神经网络进行数据传递。In step 2, the generative network, noisy autoencoder, regression network and discriminative network use feedforward neural network for data transfer.

其中，噪声自编码器具有两个隐藏全连接层，分别为1200和600个单位，由2048和4096个单位的隐藏全连接层实现；而判别网络仅由一个512单位的隐藏全连接层实现；回归网络只有一个隐藏层，由600个单位组成。本发明中的所有噪声维度均为100，可分别从两种不同的语义特征来合成伪视觉特征，同时使用回归网络和判别网络来执行语义推理和相关约束。Among them, the noise autoencoder has two hidden fully connected layers, 1200 and 600 units respectively, which are implemented by 2048 and 4096 units of hidden fully connected layers; while the discriminant network is only implemented by one hidden fully connected layer of 512 units; The regression network has only one hidden layer, consisting of 600 units. All noise dimensions in the present invention are 100, and pseudo-visual features can be synthesized from two different semantic features respectively, while using regression network and discriminative network to perform semantic reasoning and related constraints.

在步骤3中，生成网络的目的是学习数据点的概率分布，以便从中采集样本，用于数据增强机制。作为最有潜力的生成模型之一，生成对抗模型被广泛研究。In step 3, the purpose of the generative network is to learn the probability distribution of the data points in order to collect samples from it for the data augmentation mechanism. As one of the most promising generative models, generative adversarial models have been widely studied.

如图2所示，给定所见类别的训练数据D^tr，旨在学习生成网络G:Z×C→X，高斯随机噪声

并且将属性特征a_i∈R^q作为输入，然后输出第一视觉特征。一旦生成网络学习到基于可见类别样本生成第一视觉特征，便可以嵌入属性特征以生成任何看不见的视觉特征类。生成网络可以通过以下优化函数来学习：As shown in Figure 2, given training data D ^tr of the seen classes, it aims to learn a generative network G: Z × C → X, Gaussian random noise

And take the attribute feature a _i ∈ R ^q as input, and then output the first visual feature. Once the generative network has learned to generate the first visual feature based on the visible class samples, attribute features can be embedded to generate any unseen visual feature class. The generative network can be learned by the following optimization function:

其中，x₁＝G(z,a_i)是视觉空间中第i个生成的第一视觉特征表示，具有相应的视觉特征a_i和噪声z。where x ₁ =G(z, a _i ) is the i-th generated first visual feature representation in the visual space, with the corresponding visual feature a _i and noise z.

本发明还使用噪声自编码器，通过词向量作为语义特征以得到第二视觉特征，词向量是另一种语义信息，并从另一个角度描述了不可见类别样本并补充了属性特征，并将WAE扩展为附条件的WAE。The present invention also uses the noise self-encoder to obtain the second visual feature by using the word vector as a semantic feature. The word vector is another kind of semantic information, and describes the invisible category samples from another angle and supplements the attribute features, and uses the word vector as another kind of semantic information. WAE expands to conditional WAE.

如图3所示，给出了特定的条件信息(词向量)，噪声自编码器用于生成概率分布Q(Z|X)，其中Q是潜在空间的分布，P_Z是Z:

上的各向同性高斯先验分布。具体来说，潜在空间Z中引入了一个判别网络，目标是将从Z:

采样的“假”点与从Q(Z|X)采样的“真”点区分开。其中，解码器用于解码词向量w_i∈R^p和潜在变量Q(Z|X)，然后生成第二视觉特征，其中，附条件的WAE的损失函数定义如下：As shown in Figure 3, given specific conditional information (word vectors), the noisy autoencoder is used to generate a probability distribution Q(Z|X), where Q is the distribution of the latent space and _PZ is Z:

An isotropic Gaussian prior distribution on . Specifically, a discriminative network is introduced into the latent space Z, with the goal of changing from Z:

The "false" points sampled are distinguished from the "true" points sampled from Q(Z|X). Among them, the decoder is used to decode the word vector w _i ∈ R ^p and the latent variable Q(Z|X), and then generate the second visual feature, where the loss function of the conditional WAE is defined as follows:

其中，

是视觉空间中第i个生成的第二视觉特征表示，具有相应的词向量w_i。同时选择D_Z(Qz,P_Z)＝D_JS(Q_Z,P_Z)并使用对抗训练进行评估，λ＞0是一个超参数。in,

is the i-th generated second visual feature representation in the visual space, with the corresponding word vector w _i . Also choose D _Z (Qz, P _Z ) = D _JS (Q _Z , P _Z ) and use adversarial training for evaluation, λ>0 is a hyperparameter.

本发明选择融合第一视觉特征和第二视觉特征。一方面，基于人类在学习识别新对象方面的经验，属性特征比词向量包含更有效的语义信息。另一方面，与真实视觉特征相比，所产生的伪视觉特征是大量的无效信息，可以通过特征融合将无效信息删除，以保持信息的有效性。The present invention chooses to fuse the first visual feature and the second visual feature. On the one hand, attribute features contain more effective semantic information than word vectors, based on human experience in learning to recognize new objects. On the other hand, compared with real visual features, the generated pseudo visual features are a large amount of invalid information, which can be removed through feature fusion to maintain the validity of the information.

基于以上知识，有必要对由属性特征和词向量生成的伪视觉特征赋予不同的权重，以下是第一视觉特征和第二视觉特征融合成伪视觉特征的公式为：Based on the above knowledge, it is necessary to assign different weights to the pseudo-visual features generated by attribute features and word vectors. The following is the formula for fusing the first and second visual features into pseudo-visual features:

在步骤4中，判别网络用于对所述伪视觉特征和所述真实视觉特征进行分类。伪视觉特征可以尽可能成功地欺骗判别网络，随着判别网络不断提高判别能力，可以通过对抗机制来优化生成网络和判别网络，所生成的伪视觉特征的质量也不断提高。本发明使用改进的WGAN进行对抗训练，训练判别网络的对抗过程可以表示为：In step 4, a discriminant network is used to classify the pseudo visual features and the real visual features. The pseudo-visual features can deceive the discriminative network as successfully as possible. As the discriminative network continues to improve its discriminative ability, the generation network and the discriminative network can be optimized through an adversarial mechanism, and the quality of the generated pseudo-visual features is also continuously improved. The present invention uses the improved WGAN for adversarial training, and the adversarial process of training the discriminant network can be expressed as:

其中，x是真实视觉特征，x_f＝G(a,w,z)，

α～U(0,1)；等式中的前两个项近似于Wasserstein距离，而第三个项是梯度罚项，强制执行D的梯度沿直线具有单位范数。where x is the real visual feature, x _f =G(a,w,z),

α ~ U(0,1); the first two terms in the equation approximate the Wasserstein distance, while the third term is a gradient penalty term that enforces that the gradient of D has a unit norm along the line.

尽管对抗机制使生成网络具有生成真实视觉特征及类似分布的能力，但是仅确保融合的伪视觉特征是有效的还不够。因此，我们为融合的伪视觉特征和真实视觉特征通过最小二乘损失公式，以约束其分布：Although adversarial mechanisms enable generative networks to generate realistic visual features and similar distributions, it is not enough to ensure that fused pseudo-visual features are effective. Therefore, we pass the least squares loss formula for the fused pseudo visual features and real visual features to constrain their distribution:

在步骤5中，所述回归网络将所述伪视觉特征作为输入，然后将所述伪视觉特征转换成语义特征。生成网络和回归网络一起形成双重学习框架，因此它们可以相互学习。在本发明中，主要任务是生成以类嵌入为条件的视觉特征，而双重任务是将视觉特征返回到相应的类语义空间。In step 5, the regression network takes the pseudo-visual features as input, and then converts the pseudo-visual features into semantic features. The generative network and the regression network together form a dual learning framework, so they can learn from each other. In the present invention, the main task is to generate visual features conditioned on class embeddings, while the dual task is to return visual features to the corresponding class semantic space.

来自训练特征数据集采样的真实视觉特征为x，第二个是融合的伪视觉特征为x_f，借助配对的训练数据(x,a)，我们可以在监督损失下训练回归网络：The real visual feature sampled from the training feature dataset is x, and the second is the fused pseudo visual feature x _f . With the paired training data (x,a), we can train the regression network with a supervised loss:

在步骤6中，将生成网络、噪声自编码器、回归网络和判别网络的损失函数相加，可得到总的目标函数：In step 6, the loss functions of the generative network, the noise autoencoder, the regression network and the discriminant network are added to obtain the total objective function:

L＝L_WGAN+L₁+λ₂*L_R,L=L _WGAN +L ₁ +λ ₂ *L _R ,

通过选择Adam作为优化器，并将参数β₁和β₂设置为(0.9，0.999)。首先训练判别网络并优化其参数，然后固定判别网络参数，判别网络的学习率设置为0.00001；训练生成网络、回归网络并优化其参数，生成网络和回归网络的学习率设置为0.0001，零样本学习模型中的所有模块都以批次数量为128的条件下进行了训练，通过在每个数据集上训练1000个周期，并每10个周期保存一次模型参数，然后评估测试集。By choosing Adam as the optimizer and setting the parameters β1 and _β2 to ₍ 0.9, 0.999). First train the discriminant network and optimize its parameters, then fix the discriminant network parameters, set the learning rate of the discriminant network to 0.00001; train the generative network, the regression network and optimize its parameters, set the learning rate of the generative network and the regression network to 0.0001, and zero-sample learning All modules in the model were trained with a batch size of 128 by training on each dataset for 1000 epochs, saving model parameters every 10 epochs, and then evaluating on the test set.

优选地，本发明还包括步骤7，具体为：将步骤3中训练好的生成网络用于不可见类别样本的真实视觉特征生成，并对其分类，以测试步骤6中所述总目标函数。Preferably, the present invention further includes step 7, specifically: using the generating network trained in step 3 to generate real visual features of samples of invisible categories, and classifying them to test the overall objective function described in step 6.

训练模型后，为了预测不可见类别样本的标签，可以首先为每个不可见类别样本生成新样本，可以根据新样本集训练任何包含可见和不可见类的样本的新的分类器。然后将这些合成样本与训练数据中的其他样本一起，之后可以根据这个新数据集训练任何新的分类器包含可见和不可见类的样本。After training the model, in order to predict the labels of unseen class samples, a new sample can be first generated for each unseen class sample, and a new classifier can be trained based on the new set of samples containing both visible and unseen class samples. These synthetic samples are then combined with other samples in the training data, after which any new classifier can be trained based on this new dataset of samples containing both visible and invisible classes.

进一步地，将本发明与15种方法进行了比较，包括：SSE，LATEM，ALE，DEVISE，SJE，ESZSL，SYNC，SAE，DEM，RelationNet，PSR-ZSL，SP-AEN，CAPD，CVAE，GDAN。Further, the present invention is compared with 15 methods, including: SSE, LATEM, ALE, DEVISE, SJE, ESZSL, SYNC, SAE, DEM, RelationNet, PSR-ZSL, SP-AEN, CAPD, CVAE, GDAN.

对于与其他基准的合理比较，可应用一个简单的1-NN分类器用于测试。通过将GDFN与最新技术广义零样本学习进行了比较，结果显示在下表2中。For reasonable comparison with other benchmarks, a simple 1-NN classifier can be applied for testing. By comparing GDFN with state-of-the-art generalized zero-shot learning, the results are shown in Table 2 below.

表2四个基准数据集上的广义零样本学习方法的结果Table 2 Results of generalized zero-shot learning methods on four benchmark datasets

从结果可以看出，本发明在广义零样本学习数据集上取得了很好的结果。It can be seen from the results that the present invention achieves good results on the generalized zero-shot learning dataset.

对于CUB数据集，本发明在看不见的类别上取得了良好的结果，并在可见的类别上取得了最高的准确性。这说明：本发明在谐波均值方面表现良好，这再次表明本发明在可见和不可见类之间保持了良好的预测图像平衡，并且与以前的模型相比，本发明显示出更好的性能。For the CUB dataset, the present invention achieves good results on unseen classes and the highest accuracy on visible classes. This shows that the present invention performs well in terms of harmonic mean, which again shows that the present invention maintains a good balance of predicted images between visible and invisible classes, and the present invention shows better performance compared to previous models .

对于AwA2数据，本发明在看不见的类精度和谐波均值方面比最新方法(例如SP-AEN和PSR-ZSL)表现的更好，并在可见类别样本中显示出更高的精度。For AwA2 data, the present invention outperforms state-of-the-art methods such as SP-AEN and PSR-ZSL in unseen class precision and harmonic mean, and shows higher accuracy on unseen class samples.

对于SUN数据集，本发明在可见和不可见类别样本均的识别具有较高的准确性，并且对可见类别样本的识别分类上具有明显进步。For the SUN data set, the present invention has high accuracy in the recognition of both visible and invisible class samples, and has obvious progress in the recognition and classification of visible class samples.

对于aPY数据集，无关训练图像和测试图像的属性方差之间的相似性比其他数据集小得多，这表明很难对未看到的类进行合成和分类。尽管现有技术对于看不见的类别识别具有相对较低的准确率，但是使用本发明对该数据集进行测试时，可以取得良好的结果。现有技术对于可见类别样本的识别具有较高的准确率，而本发明对于可见类别样本具有较高的准确率，且在可见和不可见类之间实现了平衡，提供了准确的aPY谐波平均精度。For the aPY dataset, the similarity between the attribute variances of irrelevant training images and test images is much smaller than for other datasets, indicating that it is difficult to synthesize and classify unseen classes. Although the prior art has relatively low accuracy for unseen class recognition, good results can be achieved when testing this dataset using the present invention. The prior art has higher accuracy for the identification of visible class samples, while the present invention has higher accuracy for visible class samples, and achieves a balance between visible and invisible classes, providing accurate aPY harmonics Average precision.

综上所述，本发明通过知识迁移，融合属性和词向量两种语义特征，在对抗机制下进行训练以最小化真实样本和生成样本的分布差异，并通过回归网络将视觉特征映射到语义特征，有效地解决了模型预测结果邻域迁移的问题，可以对难以标注的样例进行识别，同时减小识别成本。To sum up, the present invention uses knowledge transfer, fuses two semantic features of attribute and word vector, conducts training under an adversarial mechanism to minimize the distribution difference between real samples and generated samples, and maps visual features to semantic features through a regression network. , which effectively solves the problem of neighborhood migration of model prediction results, and can identify samples that are difficult to label, while reducing the cost of identification.

以上实施例仅用以说明本发明的技术方案而非限制，尽管参照较佳实施例对本发明进行了详细说明，本领域的普通技术人员应当理解，可以对本发明的技术方案进行修改或者等同替换，而不脱离本发明技术方案的精神和范围。The above embodiments are only used to illustrate the technical solutions of the present invention and not to limit them. Although the present invention has been described in detail with reference to the preferred embodiments, those of ordinary skill in the art should understand that the technical solutions of the present invention can be modified or equivalently replaced. Without departing from the spirit and scope of the technical solutions of the present invention.

Claims

1. A zero sample learning method for knowledge migration from visible class samples to invisible class samples to identify never seen data classes, comprising essentially the steps of:

step 1, obtaining a training feature data set, wherein the training feature data set comprises visible class samples, and the visible class samples comprise labels, real visual features and semantic features;

step 2, building a zero sample learning model based on the generation network GAN, the noise self-encoder WAE, the regression network and the discrimination network, and initializing the generation network, the noise self-encoder, the regression network and the discrimination network in the zero sample learning model;

step 3, training a generating network and a noise self-encoder to respectively generate a first visual feature and a second visual feature, and fusing the first visual feature and the second visual feature into a pseudo visual feature according to different weights;

step 4, training a discrimination network to classify the pseudo visual features and the real visual features, and optimizing and generating the network and the discrimination network through a countermeasure mechanism;

step 5, training a regression network, and taking the pseudo-visual features as input so as to map the pseudo-visual features to semantic features;

and 6, adding the loss functions of the generation network, the noise self-encoder, the regression network and the discrimination network to obtain a total objective function, and iterating to achieve the purpose of algorithm optimization.

2. The zero-sample learning method according to claim 1, characterized in that: in step 1, the labels include quantity labels and category labels, and the semantic features include word vectors and attribute features.

3. The zero-sample learning method according to claim 1, characterized in that: in step 2, a feedforward neural network is used for data transmission among the generation network, the noise self-encoder, the regression network and the discrimination network.

4. The zero-sample learning method according to claim 2, characterized in that: in step 3, the generation network generates a first visual feature through an attribute feature and Gaussian random noise; the noise autoencoder generates a second visual feature by a word vector, a latent variable, and gaussian random noise.

5. The zero-sample learning method according to claim 1, wherein in step 3, the formula for fusing the first visual feature and the second visual feature into the pseudo visual feature is as follows:

wherein x is _f Is a pseudo-visual feature, λ is the corresponding weight, x ₁ Is a representation of the first visual characteristic,

is a second visual feature representation, the sum of the weights of the two parts of the first visual feature and the second visual feature being 1.

6. The zero-sample learning method according to claim 5, wherein in step 4, the countermeasure can be expressed as:

where x is the true visual feature, x _f ＝G(a,w,z)，

α～U(0,1)。

7. The zero-sample learning method of claim 6, characterized in that: in step 4, the pseudo visual features and the real visual features are both constrained in distribution by a least square loss formula, wherein the least square loss formula is as follows:

where x is the true visual feature, x _f Is a pseudo-visual feature.

8. The zero-sample learning method of claim 7, wherein: the total objective function in step 6 is:

L＝L _WGAN +L ₁ +λ ₂ *L _R ,

wherein λ is ₂ Is a hyper-parameter that assigns weights in different parts.

9. The zero-sample learning method according to claim 1, characterized in that: in step 6, algorithm optimization is performed using Adam as an optimizer.

10. The zero-sample learning method according to claim 1, characterized in that: and 7, using the generation network trained in the step 3 for generating the real visual features of the invisible class samples, and classifying the invisible class samples to test the total objective function in the step 6.