CN112418257B

CN112418257B - Effective zero sample learning method based on potential visual attribute mining

Info

Publication number: CN112418257B
Application number: CN201910778304.3A
Authority: CN
Inventors: 谢昱锐; 何小海; 张津; 罗晓东; 卿粼波; 吴小强
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-08-22
Filing date: 2019-08-22
Publication date: 2023-04-18
Anticipated expiration: 2039-08-22
Also published as: CN112418257A

Abstract

The invention discloses a zero sample learning method based on potential visual attribute mining. The invention mainly relates to the following steps: firstly, calculating a feature mapping matrix represented by manually defined attributes in an original feature domain, and mining a potential visual attribute set through optimization of a sparse dictionary model. And further constructing a pair-wise visual dictionary model, and guiding the learning of the target feature domain by taking the mapping relation obtained by the original feature domain as constraint information. And finally, extracting visual features of a depth residual error network (ResNet) from the input image, obtaining feature discrimination semantic representation, and realizing the prediction of the image label category. The image recognition method has the advantages of no manual participation, high recognition rate, strong robustness and the like, has certain use value aiming at the specific field of recognition of semantic objects under the condition of a small number of samples and zero samples, and can be applied practically.

Description

Effective zero sample learning method based on potential visual attribute mining

Technical Field

The invention provides a zero sample learning method based on potential visual attribute mining, which solves the recognition task of semantic objects in images under the condition of a small number of samples and zero samples and is a novel technology in the field of image recognition.

Background

Nowadays, with the rapid development of digital multimedia technology, image and video data facing people in daily life have been increased explosively, and the human society is promoted to step into the big data era. In many complicated multimedia data, how to effectively analyze and understand the content of visual data becomes increasingly important. In order to solve the problems, many computer vision methods proposed in the past years make certain progress in the aspects of feature extraction, strong supervised learning, object model construction and the like, and are successfully applied to the daily life fields of multimedia data retrieval, scene analysis, visual information text description and the like. However, due to the semantic gap between the bottom-layer features and the middle-layer and high-layer information of the existing visual data, the existing method still progresses slowly on key problems of learning of a small number of samples and zero samples, semantic attribute mining, cross-feature domain object models and the like.

The recognition of image semantic object information plays an important role in multimedia data analysis and understanding. With the construction of large-scale image databases and the wide application of convolutional neural networks, the object recognition method is rapidly improved. However, the current object recognition method still has the following disadvantages: first, the conventional visual recognition algorithm is based on a supervised learning manner, and a large amount of labeled image data is required for training a more robust classification model. However, in practical application, the acquisition of a large amount of labeled data is often difficult to realize, and particularly, the acquisition of samples needing expert knowledge to be subjected to fine labeling and rare type samples is very difficult; secondly, the existing object recognition algorithm is limited by the trained object types, the newly introduced object types cannot be effectively recognized, and the expansibility of the method is limited; in addition, traditional object recognition algorithms are not consistent with human cognitive mechanisms. The recognition of semantic objects by human beings is based on the difference between similar object classes, and the knowledge is migrated to the discrimination of new object classes. For example, a child who never sees a zebra can easily identify the animal by observing a skin texture difference from that of a general horse. In order to solve the above problems to be solved by the current object recognition method, an object recognition method based on zero samples has been proposed in recent years. The method can effectively identify the class of the newly introduced object which does not appear in the training process by means of the visual cognitive mechanism of the discriminator and through the extraction of the middle and high level semantic information.

However, the current research aiming at the zero sample object identification method still faces the following problems to be solved: first, there is a lack of research into potential visual attribute mining methods. The existing zero sample object recognition algorithm is usually only based on manually defined attribute representation, and the exploration of discriminant visual attributes is omitted, so that the semanteme of attribute space feature representation is difficult to improve. In addition, current zero-sample learning methods lack the study of object recognition models with cross-feature domain adaptation. In the process of identifying the category of a newly introduced object, the existing method always directly acts the original feature domain mapping relation on a target domain, and the defect that the feature transformation does not have domain adaptability is overcome.

Disclosure of Invention

The technical problem to be solved is as follows:

aiming at the defects of the current zero sample learning-based object identification method in the aspects of image feature representation discrimination capability and cross-feature domain adaptability, the invention tries to analyze and solve the problem from the perspective of mining the potential visual attributes of image data with more practical value. Compared with the existing zero sample learning-based object identification algorithm, the method has the main advantages that: firstly, the problem that the existing zero sample learning-based object recognition algorithm only depends on manually defined attributes, so that the feature representation semantic information is deficient is solved. The method provided by the invention excavates the potential visual attribute of the image data, establishes the mapping relation from the visual feature space to the semantic attribute space through the joint optimization of the objective function by combining the manually defined attribute, and enhances the discriminability of image feature representation. Secondly, the method takes the mapping relation learned in the original characteristic domain as a regularization constraint term, guides the learning of the mapping function from the visual characteristic to the semantic attribute representation in the target characteristic domain, improves the adaptability of the object recognition model in the characteristic domain migration process, and improves the generalization performance of the zero sample learning algorithm.

The technical scheme is as follows:

in order to establish the correlation between the image visual characteristics and the semantic attribute representation, the method comprises two parts in the training stage, namely an original characteristic domain (known object type) training process and a target characteristic domain (unknown object type) training process, wherein the original characteristic domain and the target characteristic domain have no intersection. Firstly, by constructing an optimization model of the following objective function, a mapping function from visual features to manually defined attribute representation is learned in the original feature domain training process. The mathematical form of the objective function is as follows:

wherein, X _s All the graphs in the invented method are visual feature matrixes of all training samples in the original feature domainThe visual features of the image are all image features extracted by using a depth residual error network (ResNet); a. The _s A manually defined attribute representation matrix, D, representing all training samples _s For the mapping of visual features to manually defined attribute representations requiring an optimization solution, d _i Representation matrix D _s The ith column vector. The least mean square optimization problem with the above objective function as the standard can be obtained by the following mathematical form with respect to the mapping moment D _s Closed solution of the array.

The manually defined attribute representation is mainly based on the sharing characteristics among all the classes of objects, and in order to obtain feature semantic representation with higher discriminability, the method further excavates the potential visual attributes of all the classes of objects besides the attributes based on manual definition so as to further improve the discriminability after the visual features of the images are mapped to the semantic attribute space. Specifically, in the raw feature domain, the underlying visual properties of the image data are mined by optimizing the following objective function.

Wherein,

set of potential visual attributes learned for the original feature domain, d _j Is a matrix>

Corresponds to the jth column vector, i.e. the jth potential visual attribute, in>

Matrix X representing visual characteristics _s Is at>

Is represented by a value of>

The correlation between visual features and potential visual attributes is described, parameter λ ₁ For balancing regularization constraints in the objective function. Sparse representation model with the above objective function as a criterion can be paired with ∑ er>

And carrying out iterative solution by adopting an alternating optimization mode.

Mapping matrix D extracted from original characteristic domain _s And a set of visual attributes

The method further establishes the association between the visual features and the semantic attribute representation aiming at the target feature domain, and simultaneously mines the potential discriminant visual attributes. Specifically, the method constructs a cross-feature-domain paired visual dictionary learning model, and jointly realizes the solution of a characteristic mapping matrix and the extraction of potential visual attributes. The specific mathematical form of the optimization model is as follows:

in the above formula, X _t Training the feature matrix of the sample for the target domain, Y _t Artificially defined attribute representation of training samples for a target domain, D _t A mapping of visual features in the target domain to manually defined attribute representations is described,

for a mined set of target domain potential visual attributes, ->

Represents X _t A representation coefficient on a visual property.

In the target domain training model, the first two cost items

Reconstructing residual terms for data, ensuring the use of a learned mapping matrix D _t And the potential visual attribute->

Can effectively reconstruct input characteristic data X _t (ii) a Cost item

Enabling the manually defined attribute representations corresponding to the same visual features and the potential visual attribute representations to have similarity; the method aims to solve the problem of inconsistency of data feature distribution between an original feature domain and a target feature domain. Cost item is introduced into dictionary model>

And &>

Learning in the domain by means of primitive features D _s 、

And using the feature mapping matrix D as constraint information to guide the feature mapping matrix D in the target feature domain _t And the potential visual attribute->

The extraction of (2) to solve the problem of domain adaptability among different feature domains; in addition, a regularization term in the model>

The effectiveness of the final solution of the objective function is ensured; parameters alpha, beta, rho and mu in the formula ₁ 、μ ₂ 、μ ₃ For balancing dictionary modelsDifferent cost terms in (1). The dictionary learning objective function is a multivariable optimization problem, and the iterative computation is specifically carried out by adopting an alternative optimization strategy in the method. Specifically, when one variable in the objective function is optimized, the other variables are fixed, so that the original optimization problem can be converted into a plurality of optimization sub-problems to be solved.

In the testing stage, an input image is given, the image depth residual error network (ResNet) features are firstly extracted, the semantic attribute representation of the input image is further calculated by optimizing the following objective function, and the specific solved mathematical form is as follows:

in the formula, x is the depth characteristic of the input image, y is the characteristic x in a mapping matrix D _t The above expression coefficient, i.e. the semantic attribute representation of the input visual feature x, the parameter λ ₂ The method is used for balancing the residual error item of data reconstruction and coefficient sparsity constraint. In the method, for solving the objective function, a Feature-signature search (Feature-signature) algorithm is specifically adopted to calculate the variable y to be optimized. After the semantic attribute representation y of the input image is obtained, the semantic label category of the input image is further obtained through a Nearest Neighbor (NN) search algorithm, and finally the problem of identification of the zero sample object category is solved.

[1] Sparse representation models, see literature (M.Aharon, M.Elad, A.Bruckstein, K-svd: an algorithm for designing-complete dictionary for sparse representation, IEEE trans.Signal Process.Vol.54, 4311-4322, 2006.)

[2] Deep residual error network (ResNet), see literature (K.He, X.Zhang, S.Ren, and J.Sun.deep residual error for image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770-778, 2016.)

[3] Feature assignment search (Feature-signature) algorithm, see literature (Honglak Lee, alexis Battle, rajat Raina, and Andrew Y. Ng. Effectionspace coding algorithms of the Conference on Neural Information Processing Systems, pages 801-808.2007.)

[4] Zero sample method DAP, see literature (C.H.Lambert, H.Nickisch, and S.Harmeling.Attribute-based classification for zero-shot visual object localization. IEEE Transactions on Pattern Analysis and Machine Analysis, vol.36,453-465,2014.)

[5] Zero sample method SAE, see literature (E.Kodirov, T.Xiaong, and S.Gong.Semantic Autoencoder for zero-shot

[6] Zero sample method ESZSL, see literature (B.Romera-seeds and P.H.Torr.an embrassin simple approach zero-shot Learning. In International Conference on Machine Learning, pages2152-2161, 2015.)

Has the advantages that:

the effective zero sample learning method based on potential visual attribute mining provided by the invention has the characteristics of no manual participation, high classification accuracy and the like. The main innovation points of the invention are the proposed potential visual attribute mining method and the constructed cross-feature-domain pairwise visual dictionary learning model. The method provided by the invention is different from the current zero sample learning method which only depends on manually defined attributes by mining the potential visual attributes of the image data, and overcomes the problem of lack of semantic expression of the current image data features. The inventive method is practical and effective.

Drawings

FIG. 1 is a flow chart of a zero sample learning method based on potential visual attribute mining

FIG. 2 evaluation of accuracy of the method of the present invention and existing zero sample learning methods on the AWA2 object recognition database

FIG. 3 recognition accuracy of the method of the present invention on typical object class images

Detailed Description

The method is realized on a Matlab R2016b experimental platform, and mainly comprises four steps as shown in fig. 1, namely original feature domain artificial definition attribute representation mapping matrix learning, original feature domain potential visual attribute mining, target feature domain paired visual dictionary learning and input image semantic label prediction. The method comprises the following specific steps:

step one, learning a mapping matrix represented by an original feature domain manually defined attribute:

training sample visual characteristic data X according to original characteristic domain _s And a manually defined attribute representation A corresponding to the sample data _s Calculating a mapping matrix D between the visual features of the image in the original feature domain and the manually defined attribute representation _s . Specifically, the method is obtained by optimizing the following objective function:

step two, mining the potential visual attributes of the original characteristic domain:

and (4) excavating potential visual properties of image data in the original features by optimizing the following sparse dictionary learning model.

In the formula

The set of potential visual attributes for the acquired original feature domain.

Step three, learning a target feature domain pair visual dictionary:

constructing a cross-feature-domain paired visual dictionary learning model in a target feature domain and utilizing a mapping matrix D learned in an original feature domain _s And a set of potential visual attributes

Guiding the solving of the mapping function between the image characteristics and the semantic attribute representation, and mining the potential visual attributes. The objective function of the pairwise visual dictionary model is as follows:

step four, input image semantic label prediction:

inputting an image, extracting a depth residual network (ResNet) feature x of the image, and calculating semantic attribute representation of the input image by optimizing the following objective functions:

and further utilizing the solved semantic attribute to express y, and obtaining the semantic label category of the input image through a Nearest domain search algorithm (NN) to realize the identification of the zero sample object category.

The invention is characterized in step two and step three. Any use of step two and step three is within the protection field of the present invention.

Claims

1. An effective zero sample learning method based on potential visual attribute mining is characterized by comprising the following steps:

(1) Firstly, extracting depth residual error network (ResNet) visual characteristics of images for all training sample images of an image data set in an original characteristic domain, then inputting the visual characteristics into an objective function of a least mean square problem, and solving a mapping matrix represented by manually defined attributes of the original characteristic domain;

(2) In order to realize the mining of the potential visual attributes of the image data, inputting an original characteristic domain sparse dictionary model by using the training sample image characteristics extracted in the step (1) to obtain an original characteristic domain potential visual attribute set; the mining formula of the potential visual attributes of the image data is as follows:

X _s for training the visual feature matrix of the image, < >>

Set of potential visual attributes learned for a raw feature domain，Y _s ^v Is X _s Is at>

Is a coefficient of expression, parameter λ ₁ For balancing regularization constraints in the objective function;

(3) Inputting the mapping matrix and the potential visual attribute set learned in the steps (1) and (2) into a pair visual dictionary model constructed, and solving the mapping matrix and the potential visual attribute set in the target feature domain; wherein, the objective function of the paired visual dictionary model is as follows:

wherein, X _t Training the feature matrix of the sample for the target domain, Y _t Artificially defined attribute representation of training samples for a target domain, D _t A mapping of visual features in a target domain to manually defined attribute representations is described,

set of potential visual properties for mined target domain, Y _t ^v Represents X _t Representation coefficients on a visual attribute>

Parameters α, β, ρ, μ for the regularization term ₁ 、μ ₂ 、μ ₃ For balancing different cost terms in the dictionary model;

(4) Extracting depth residual error network (ResNet) features from the test image, inputting the features into a target feature domain sparse dictionary model, solving semantic feature representation of the test image, and predicting input image class labels through a nearest domain search algorithm (NN).

2. The effective zero-sample learning method based on potential visual attribute mining as claimed in claim 1, wherein the obtaining of the original feature domain potential visual attribute set comprises extracting deep residual error network (ResNet) visual features from a training image set, optimizing a sparse dictionary learning model, and realizing mining of the original feature domain potential visual attribute set.

3. The effective zero sample learning method based on potential visual attribute mining as claimed in claim 1, wherein the number of visual dictionary words in the original feature domain sparse dictionary model is 200.

4. The efficient zero-sample learning method based on potential visual attribute mining as claimed in claim 1, wherein the paired visual dictionary model introduces consistency constraints between the manually defined attribute representations and the potential visual attribute representations, and guides the learning of the target feature domain by using the mapping matrix and the set of potential visual attributes in the original target domain.

5. The method of claim 4, wherein the consistency constraint provides similarity between the manually defined attribute representation of the input image features and the potential visual attribute representation.