CN109993197A

CN109993197A - A kind of zero sample multi-tag classification method based on the end-to-end example differentiation of depth

Info

Publication number: CN109993197A
Application number: CN201811495479.5A
Authority: CN
Inventors: 冀中; 李慧慧; 庞彦伟
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2018-12-07
Filing date: 2018-12-07
Publication date: 2019-07-09
Anticipated expiration: 2038-12-07
Also published as: CN109993197B

Abstract

A kind of zero sample multi-tag classification method based on the end-to-end example differentiation of depth, training stage include: that trained more exemplary characteristics extract network；Extract the corresponding label characteristics of training sample；Visual signature for realizing multi-modal fusion, and excavates the incidence relation between label and label, between sample and label to the cross-module state mapping network training in label characteristics space；Constraints module between the label of training sample and each label of the label of test sample；The optimization of the final goal function of training stage.Test phase directly realizes that the classification of zero sample multi-tag includes: to extract network using more exemplary characteristics to extract the more exemplary characteristics of test sample using ad-hoc network acquired in the training stage；Extract the corresponding label characteristics of test sample；The classification of test sample multi-tag.The present invention can realize multi-tag image labeling to unmarked image.

Description

Zero sample multi-label classification method based on deep end-to-end example differentiation

Technical Field

The invention relates to a zero sample multi-label classification method. In particular to a zero sample multi-label classification method based on deep end-to-end example differentiation.

Background

As data information has grown explosively, the motivation for people to intelligently use data and mine it to extract information about its effectiveness has also grown. The ability of machine learning models to model and solve complex tasks makes this study a great advance, for two main reasons: more computing power and more tagged data. The traditional Single-Label image Classification (Single-Label Classification) system refers to labeling a Single image only containing a Single class, and in order to accurately identify a certain class of images, a classifier must be learned according to a known training data set, and then the classifier is used for classifying test images, wherein the class of the test images is determined to appear in a training stage. In practical situations, training data and labeling information are often difficult to obtain, and on one hand, things in the world are very various and are continuously increasing; on the other hand, for a certain category, it can be further subdivided into many subclasses. Thus it can be seen that visual recognition systems are generally limited to training sample classes and model expansion capabilities are affected. To solve this problem, early studies proposed classifying unseen classes during training using auxiliary semantic information such as text, which is called Zero-sample Learning (Zero-Shot Learning) and is derived from the ability of human beings to recognize new things just by description. At present, a zero sample learning technology is mainly used for a single-label image classification task, in practical application, different regions of one image often correspond to a plurality of classes, and how to classify the regions into one of the classes is a multi-label image classification technology, namely, the zero sample multi-label image classification task, which can meet the actual requirement and solve the problem of label missing.

The zero-sample multi-label classification task is more challenging compared with two sub-problems, and specifically, challenges of zero-sample learning, such as a semantic gap problem, a domain offset problem and a hubness problem, are faced; the semantic explosion problem in the multi-label classification task is faced; in addition, the multi-label zero-sample image classification task needs to consider not only the complex semantic relation between the seen classes, but also the semantic relation between the unseen classes. For example, for a given multi-label observation sample x, the number of classes included is n, and conventional multi-label image classification considers the problem as n independent single-label classification problems, the process is redundant and has low precision, how to efficiently and accurately realize class labeling is critical to effectively utilize the internal association between images and classes, and thus the zero-sample multi-label classification problem mainly solves two key problems: (1) a cross-modal mapping model from a sample x visual representation to a corresponding multi-label semantic representation realizes knowledge transfer between a known class and an unknown class and simultaneously establishes visual and semantic association; (2) and the interrelationship between classes and images and between classes is reasonably modeled, and the high-efficiency and accurate multi-label classification is realized.

Representation Learning (Representation Learning) refers to a general term of a Learning technology for Learning a feature Representation, and in the deep Learning field, refers to a way that a sample x is effectively characterized in a certain form, and three common data ways for a computer in deep Learning are as follows: the method comprises the steps of local expression, coefficient expression and distributed expression, and typical expression learning models comprise a CNN network supervised feature extraction, an unsupervised feature characterization based on variational self-coding and Boltzmann machine, some fine-ting semi-supervised learning mechanisms and the like. One of the main reasons for the strong modeling and knowledge extraction capabilities of deep learning is to perform effective expression on an observation sample, and therefore an effective expression is important for simplifying the learning task and improving the learning performance. The most intuitive way to represent an efficient evaluation of a learning model is to use the features proposed by the model for classification, such as feature extraction based on CNN and performance evaluation by softmax classification. The distributed feature expression of auxiliary semantic information in zero sample learning, namely a Word vector method (common models Word2Vec and Glove), is effective embodiment for representing learning; the other type of middle layer auxiliary semantic information, namely attribute characteristics, belongs to a coefficient expression mode in the expression learning; visual feature extraction based on VGG networkI.e. a typical representation learning method, the visual characteristics of the sample x are characterized by R^DAnd representing the feature vector as D dimension. For a multi-label image, besides reasonable representation of auxiliary semantic information of labels, due to the fact that the multi-label image has various targets and abundant features, the single-dimensional feature expression capability of a classical CNN network is not enough, and richer multi-channel and multi-dimensional visual feature representations and corresponding representation learning models are needed.

Multi-example Multi-Label learning (MIML) is mainly directed to scenes where objects have different targets and different categories, such as text classification, where each document has partial sentences as examples and corresponds to many categories. When a common machine learning technology solves a practical problem, it is a common practice to extract object features, describe the object with a feature vector, so as to obtain an instance (instance), and then associate the instance with a class label (label) corresponding to the object, when a larger instance set is available, a certain learning algorithm can be used to learn a mapping between an instance space and a label space (or a label word vector space), and the mapping can predict that the corresponding label of the instance is not seen. However, real-world objects often have multiple semantics, for example, an image includes "elephant", "blue sky", "white cloud", and "grassland", a feature vector is extracted for the image to obtain an example, n classifiers are respectively trained to recognize, that is, it is inefficient to find a one-to-one classification relationship; the efficiency of training an n-way classifier, namely seeking one-to-many classification relation is improved, but the key point is that the visual feature representation is single, each class corresponds to the same visual feature vector, and the n-way classifier has no discriminability and interpretability. Therefore, an example differentiation idea is provided, different targets of the complex object are represented as different example feature vectors, and at the moment, a many-to-one or many-to-many strategy is adopted for classification, so that the actual scene requirements are further met.

The traditional zero sample classification method seeks a one-to-many relationship, the classification method ignores rich information in multi-label sample images, the sample visual feature characterization is excessively simplified, partial useful information is lost, and subsequent learning and classification stages face constraint and difficulty. The invention utilizes the deep learning network to learn the multi-example characteristic representation form of the complex image, fully utilizes the relation between the multi-label sample image and the category and between the categories, improves the existing multi-label image classification technology, realizes the classification technology suitable for the zero-sample multi-label image, improves the image labeling precision and solves the label missing problem to a certain extent.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a zero sample multi-label classification method based on deep end-to-end example differentiation.

The technical scheme adopted by the invention is as follows: a zero sample multi-label classification method based on deep end-to-end example differentiation comprises a training stage and a testing stage, wherein,

the training stage is to obtain an end-to-end network, and the end-to-end network is composed of a multi-example feature extraction network, a cross-modal mapping network, labels of training samples and constraint modules among the labels of testing samples; the training phase specifically comprises:

11) training a multi-example feature extraction network;

12) extracting label features corresponding to the training samples;

13) performing cross-modal mapping network training of visual features to a label feature space, and mining incidence relations between labels and between samples and labels;

14) a constraint module between the label of the training sample and each label of the testing sample;

15) optimizing a final objective function in a training phase;

the testing stage is to directly utilize the end-to-end network obtained in the training stage to realize zero-sample multi-label classification; the testing stage specifically comprises:

21) extracting multi-example features of the test sample by using a multi-example feature extraction network;

22) extracting label features corresponding to the test samples;

23) and (4) multi-label classification of the test sample.

The multi-example feature extraction network in the step 11) takes the network structures of the last third layer and the layers before the last third layer of the VGG-16 network as the multi-example feature extraction network, wherein the output of the 3-dimensional convolution layer of the last third layer is taken as the multi-example visual feature x of the training sample_i∈R^t×pWhere i is 1, …, n, p is the visual feature space dimension of multiple examples, n is the number of training samples, and t is the number of multiple examples of each training sample.

Step 12) inputting the label information of the image into the distributed language model to obtain training sample label semantic characteristics Y ═ Y₁,…y_s]∈R^q×sAnd q is the dimensionality of the semantic vector, and s is the number of labels corresponding to the training samples.

Step 13) comprises: after cross-modal feature transformation is carried out on the training samples through a cross-modal mapping network W, the assumption is satisfied in a semantic space: the similarity between the label related to the training sample and the training sample is large, and the similarity between the label unrelated to the training sample and the training sample is small, namely the multi-example visual feature of the training sample after the cross-modal feature transformationSemantic feature y with tag_jPerforming similarity measurement:

where F represents the training sample and label correlation score vector, x_iFor multiple instances of training samplesA visual characteristic;

multi-instance visual feature x of training sample_iSemantic feature y with tag_jAverage score f (x) with similarity_i,y_j) Is represented as follows:

f(x_i,y_j)＝avg_tF(x_i,y_j)

where t is the number of multiple instances of each training sample,

the labels related to the training samples, the labels not related to the training samples and the training samples satisfy the following relationship:

wherein ,labels unrelated to the training samples;

based on the maximum interval frame of multi-label learning, the optimal objective function of all training samples existing in the training stage is as follows:

wherein ,the loss function is ordered for the largest interval,in order to be a term of regularization,for norm, W is the cross-modal mapping network, M is the multi-instance feature network, and λ is for balancing the regularization term and ordering loss functionParameters, wherein maximum interval orders a loss functionSatisfies the following conditions:

wherein ,f₀The similarity threshold can be adjusted experimentally.

The optimized objective functions of all the training samples realize multi-modal mapping from the visual space of the training samples to the semantic space of the training samples, and the association information between the labels corresponding to the training samples and the sequencing information between the labels are unified to the optimized objective functions for optimization.

Step 14) the constraint module comprises:

(1) and a similarity constraint module among the labels:

wherein ,the similarity relation among the labels is obtained through a WordNet dictionary method, the WordNet dictionary method is based on a tree-shaped hierarchical structure, the similarity relation among the labels is reflected through a connection path among the labels, and the similarity among the labels is defined as the reciprocal of the path length path _ len (h, z);

(2) a statistical symbiosis constraint module among the labels:

wherein ,for the statistical symbiotic relationship among the labels, HC (h, z) represents the co-occurrence times of the label h and the label z, HC (h) represents the co-occurrence times of the label h, and HC (z) represents the co-occurrence times of the label z.

Step 15) combining the incidence relation between the original image of the training sample and the label of the training sample and the relation between the label of the training sample and the label of the test sample to obtain a final objective function as follows:

wherein ,the loss function is ordered for the largest interval,in order to be a term of regularization,to norm, W is the cross-modal mapping network, M is the multi-instance feature network, λ is a parameter used to balance the regularization term and the ordering loss function,similarity constraint module between labels of training sample and test sampleOr a statistical symbiosis constraint module between labels of the training sample and the test sampleWherein the maximum interval orders the penalty functionSatisfies the following conditions:

wherein ,f₀The similarity threshold can be adjusted according to experiments;

the final goal of the training phase is to optimize the end-to-end network, resulting in the cross-modal mapping network W and the multi-instance feature network M.

Step 21) includes: inputting the test sample image into the trained multi-example feature extraction network to obtain the multi-example feature x 'of the test sample'_l∈R^r×pAnd l is 1, …, m is the number of test samples, p is the visual feature space dimension of multiple examples, and r is the number of multiple examples of each test sample.

Step 22) inputting the label information of the image into the distributed language model to obtain the semantic feature Y' ═ Y of the label of the test sample₁’,…,y_u’]∈R^q×uAnd q is the semantic vector dimension, and u is the number of labels corresponding to the test sample.

Step 23) is a test sample multiple example feature x'_lMapping to test sample tag semantic feature spaceW is a cross-modal mapping network, and the multi-example characteristics of the mapped test sample are directly subjected toAnd test sample candidate tag y'_oO 1, …, u performs a similarity measure:

r(x’_l) For the similarity degree of the multi-example characteristics of the mapped test sample and any test sample candidate label, the similarity degree is obtainedIf the similarity degree is greater than a set threshold value, testing multiple example characteristics x 'of the sample'_lContaining the candidate tag.

The zero sample multi-label classification method based on the deep end-to-end example differentiation analyzes the feasibility and the limitation of the existing scheme aiming at the problem of multi-label zero sample image classification, improves the characteristic representation mode of the complex scene image, fully excavates the ambiguity of the image, and realizes the association with the semantic word vector among the multiple labels on the basis, so that the multi-label image labeling can be realized on the unmarked image. The advantages are mainly reflected in that:

(1) the novelty is as follows: the zero-sample multi-label image classification aims at classifying and labeling unseen classes, the label information and samples of a complex actual scene graph are difficult to obtain, the classification purpose is realized by means of intermediate auxiliary information in combination with a zero-sample learning thought, the zero-sample multi-label image classification is a bold attempt in the field of multimedia understanding, a conventional visual feature characterization mode is broken, example differentiation segmentation is carried out on image information, and the zero-sample multi-label image classification is a new breakthrough and attempt for task research.

(2) Multimode property: zero sample learning belongs to the field of multi-modal learning, and auxiliary information obtained from other channels besides visual information is required to realize learning and prediction of unseen classes, so that the zero sample learning has multi-modal attitude no matter the problem of single-label or multi-label zero sample classification. Particularly, the method relates to two modalities of vision and semantics, and relates to the knowledge fields of transfer learning, cross-modality learning and the like.

(3) End-to-end: the three functions of multi-example feature characterization learning, multi-example multi-label classification and multi-mode mapping are unified in the same network framework, and a unique target constraint function is used for uniformly adjusting network parameters to achieve the optimal classification performance.

(4) The practicability is as follows: the single-label zero-sample image classification is suitable for each sample image only corresponding to a single labeling category, images in actual life often contain complex background information of multiple categories, and the multi-label zero-sample image classification is used for labeling more complex scene images according to the actual situation and better meets the actual scene requirements.

Drawings

FIG. 1 is a block diagram of a framework for an end-to-end network in accordance with the present invention;

FIG. 2 is a flow chart of a training process for solving the multi-label zero-sample classification problem in the present invention.

Detailed Description

The following describes a zero sample multi-label classification method based on deep end-to-end example differentiation according to the present invention in detail with reference to the following embodiments and the accompanying drawings.

The zero sample multi-label classification method based on deep end-to-end example differentiation comprises a training stage and a testing stage, wherein,

the training stage is to obtain an end-to-end network, and the end-to-end network is composed of a multi-example feature extraction network, a cross-modal mapping network, and constraint modules among labels of training samples and labels of testing samples, as shown in fig. 1; the training phase is shown in fig. 2, and specifically includes:

11) training a multi-example feature extraction network;

the multi-example feature extraction network takes network structures of a last three layers and before the last three layers of the VGG-16 network as the multi-example feature extraction network, wherein the output of a 3-dimensional convolution layer of the last three layers is taken as the multi-example visual feature x of the training sample_i∈R^t×pWhere i is 1, …, n, p is the visual feature space dimension of multiple examples, n is the number of training samples, and t is the number of multiple examples of each training sample.

12) Extracting label features corresponding to the training samples; inputting label information of an image into a distributed language model to obtain training sample label semantic characteristics Y ═ Y₁,…y_s]∈R^q×sAnd q is the dimensionality of the semantic vector, and s is the number of labels corresponding to the training samples.

13) Training a cross-modal mapping network from the visual features to a label feature space, realizing multi-modal fusion, and mining incidence relations between labels and between samples and labels; the method comprises the following steps:

after cross-modal feature transformation is carried out on the training samples through a cross-modal mapping network W, the assumption is satisfied in a semantic space: the similarity between the label related to the training sample and the training sample is large, and the similarity between the label unrelated to the training sample and the training sample is small, namely the multi-example visual feature of the training sample after the cross-modal feature transformationSemantic feature y with tag_jPerforming similarity measurement:

where F represents the training sample and label correlation score vector, x_iMulti-example visual features for training samples;

f(x_i,y_j)＝avg_tF(x_i,y_j)

where t is the number of multiple instances of each training sample,

wherein ,labels unrelated to the training samples;

wherein ,the loss function is ordered for the largest interval,in order to be a term of regularization,is a norm, W is a cross-modal mapping network, M is a multi-instance feature network, and λ is a parameter used to balance the regularization term and the ordering loss function, wherein the maximum interval ordering loss functionSatisfies the following conditions:

wherein ,f₀The similarity threshold can be adjusted experimentally.

14) A constraint module between the label of the training sample and each label of the testing sample; the method comprises the following steps:

(1) and a similarity constraint module among the labels:

(2) a statistical symbiosis constraint module among the labels:

15) Optimizing a final objective function in a training phase;

the final objective function is obtained by combining the incidence relation between the original image of the training sample and the label of the training sample and the relation between the label of the training sample and the label of the test sample as follows:

wherein ,f₀The similarity threshold can be adjusted according to experiments;

21) extracting multi-example features of the test sample by using a multi-example feature extraction network; the method comprises the following steps:

inputting the test sample image into the trained multi-example feature extraction network to obtain the multi-example feature x 'of the test sample'_l∈R^r×pAnd l is 1, …, m is the number of test samples, p is the visual feature space dimension of multiple examples, and r is the number of multiple examples of each test sample.

22) Extracting label features corresponding to the test samples; inputting label information of an image into a distributed language model to obtain a semantic feature Y' [ < Y > ] of a label of a test sample₁’,…,y_u’]∈R^q×uAnd q is the semantic vector dimension, and u is the number of labels corresponding to the test sample.

23) And (4) multi-label classification of the test sample.

Is to test sample multiple example feature x'_lMapping to test sample tag semantic feature spaceW is a cross-modal mapping network, and the multi-example characteristics of the mapped test sample are directly subjected toAnd test sample candidate tag y'_oO 1, …, u performs a similarity measure:

r(x’_l) For the similarity degree of the multi-example characteristics of the mapped test sample and any test sample candidate label, when the similarity degree is more than a set threshold value, determining that the similarity degree is more than the set threshold valueTest sample Multi-example feature x'_lContaining the candidate tag.

Claims

1. A zero sample multi-label classification method based on deep end-to-end example differentiation is characterized by comprising a training stage and a testing stage, wherein,

11) training a multi-example feature extraction network;

12) extracting label features corresponding to the training samples;

15) optimizing a final objective function in a training phase;

22) extracting label features corresponding to the test samples;

23) and (4) multi-label classification of the test sample.

2. The method according to claim 1, wherein the multi-instance feature extraction network in step 11) uses network structures at last three layers and before last three layers of the VGG-16 network as the multi-instance feature extraction network, and wherein the output of the 3-dimensional convolution layer at last three layers is used as the multi-instance visual feature x of the training sample_i∈R^t×pWhere i is 1, …, n, p is the visual feature space dimension of multiple examples, n is the number of training samples, and t is the number of multiple examples of each training sample.

3. The method according to claim 1, wherein the step 12) is to input label information of the image into the distributed language model to obtain training sample label semantic feature Y ═ Y₁,…y_s]∈R^q×sAnd q is the dimensionality of the semantic vector, and s is the number of labels corresponding to the training samples.

4. The method according to claim 1, wherein the step 13) comprises: after cross-modal feature transformation is carried out on the training samples through a cross-modal mapping network W, the assumption is satisfied in a semantic space: the similarity between the label related to the training sample and the training sample is large, and the similarity between the label unrelated to the training sample and the training sample is small, namely the multi-example visual feature of the training sample after the cross-modal feature transformationSemantic feature y with tag_jPerforming similarity measurement:

f(x_i,y_j)＝avg_tF(x_i,y_j)

where t is the number of multiple instances of each training sample,

wherein ,labels unrelated to the training samples;

wherein ,f₀The similarity threshold can be adjusted experimentally.

5. The method according to claim 1, wherein the constraint module of step 14) comprises:

(1) and a similarity constraint module among the labels:

(2) a statistical symbiosis constraint module among the labels:

6. The method according to claim 1, wherein the step 15) is to combine the correlation between the original image of the training sample and the label of the training sample and the relationship between the label of the training sample and the label of the test sample to obtain a final objective function as follows:

wherein ,f₀The similarity threshold can be adjusted according to experiments;

7. The method according to claim 1, wherein the step 21) comprises: inputting the test sample image into the trained multi-example feature extraction network to obtain the multi-example feature x 'of the test sample'_l∈R^r×pAnd l is 1, …, m is the number of test samples, p is the visual feature space dimension of multiple examples, and r is the number of multiple examples of each test sample.

8. The method according to claim 1, wherein step 22) is to input label information of the image into the distributed language model to obtain a semantic feature Y' ═ Y of the label of the test sample in the distributed language model₁’,…,y_u’]∈R^q×uAnd q is the semantic vector dimension, and u is the number of labels corresponding to the test sample.

9. The method of zero-sample multi-label classification based on deep end-to-end example differentiation according to claim 1, characterized in that step 23) is testing sample multi-example feature x'_lMapping to test sample tag semantic feature spaceW is a cross-modal mapping network, and the multi-example characteristics of the mapped test sample are directly subjected toAnd test sample candidate tag y'_oO 1, …, u performs a similarity measure:

r(x’_l) The similarity degree of the mapped test sample multi-example feature and any test sample candidate label is determined, and when the similarity degree is greater than a set threshold value, the test sample multi-example feature x'_lContaining the candidate tag.