CN107220663B

CN107220663B - Automatic image annotation method based on semantic scene classification

Info

Publication number: CN107220663B
Application number: CN201710346426.6A
Authority: CN
Inventors: 葛宏伟; 王志强; 孙玮婷; 孙亮
Original assignee: Dalian University of Technology
Current assignee: Dalian University of Technology
Priority date: 2017-05-17
Filing date: 2017-05-17
Publication date: 2020-05-19
Anticipated expiration: 2037-05-17
Also published as: CN107220663A

Abstract

The invention belongs to the field of computer application and computational vision, and relates to an automatic image annotation algorithm based on semantic scene classification. The method comprises the steps of detecting semantic scene information of a label by using a non-negative matrix decomposition-based method, mapping a training set sample to a corresponding scene in a probability mode, and training a scene classifier based on an extreme learning machine and a differential evolution algorithm by using the scene information of the sample. And finally, rapidly mapping the sample to be labeled to a sample subset related to the scene by using a scene classifier, and completing labeling in the sample subset by using a KNN-based algorithm. The invention not only reduces the range of searching the nearest sample and improves the algorithm efficiency, but also leads the KNN algorithm to label in the sample set related to the semantics, thereby reducing the interference of noise and improving the labeling effect. The number of scenes in the method is far less than that of the labels, so that the problem that the method based on model learning is not suitable for a data set with huge labels is solved.

Description

Automatic image annotation method based on semantic scene classification

Technical Field

The invention relates to the field of computer application and computational vision, in particular to an automatic image annotation algorithm based on semantic scene classification.

Background

The management and retrieval of images through image labels is a common, simple and effective way, but a large number of images without labels or with incomplete labels still exist on the internet, so that the design of an effective automatic image labeling and classifying algorithm is a key technology for solving the problem. In recent years, many studies have been made on the problem of automatic labeling of images. The main research methods can be divided into two categories: a model learning based method and a database search based method.

The method based on the database search directly provides the label candidate sequence according to the label of the labeled image in the database, and has the characteristics of simplicity and effectiveness. The algorithm of TagProp (Guillamin M, Mensink T, Verbek J, et. TagProp: discrete metric learning in neighboring neighbor models for image auto-inversion [ C ]// IEEE, International Conference on computer Vision. IEEE,2010: 309-316) designs a metric learning model to obtain a feature representation with more discriminating strength, thereby improving the performance of the KNN method. The 2PKNN (Verma Y, Jawahar C V. image interpretation using metric Learning in semiconductor quasi neuro [ M ]// Computer Vision-ECCV2012. spring Berlin Heidelberg,2012:836-849.) algorithm considers the problem of data set unbalance, and completes labeling in the processed data balanced subset by a KNN method, thereby improving the algorithm efficiency. NMF-KNN (Kalayh M, Idres H, Shah M. NMF-KNN: Image Annotation Using Weighted Multi-view Non-negative Matrix Factorization [ C ]// IEEE Conference on Computer Vision and Pattern recognition. IEEE,2014:184-191.) algorithm generates a specific generation model for each Image to be annotated to complete the Annotation task, so that the Annotation effect is improved, but the algorithm is high in complexity and is not suitable for practical application. The SWIM (LiuH, Li X, Zhang S. learning Instrument Correlation Functions for multilabel Classification [ J ]. IEEE Transactions on Cybernetics,2016,47(2):499-510.) algorithm considers the mapping relationship between training data and test data, proposes a weighted KNN algorithm, and realizes image labeling.

The method based on the data set search mainly has two problems, one is that the problem of tag co-occurrence is ignored, so that the accuracy is low, and the other is that the KNN-based algorithm is low in efficiency under the condition of a large-scale database.

In the model learning-based approach, the automatic image labeling problem can be regarded as a multi-class classification problem or a bi-class problem for each label. An SVIA (Sun L, Ge H, Yoshida S, et al. support vector description of clusters for content-based image interpretation [ J ]. Pattern Recognition,2014,47(3):1361-1374.) algorithm learns one SVM model of one-class for each tag, and then scores the recommended tag sequence again by using Bayesian reasoning in consideration of the statistical relationship between tags to complete the labeling task. LDMKL and SDMKL (JiuM, Sahbi H. nonlinear Deep Kernel Learning for Image interpretation) [ C ]// IEEEInternational Conference on Acoustics, Speech and Signal processing. IEEE,2016:1551-1555.) algorithms design a nonlinear Deep Kernel Learning model, and one-vs-rest strategy is adopted to independently realize labeling tasks for each label Learning classifier. An algorithm (Darwish S M. combining fibrous Image analysis [ J ] automatic multi-layer Image Processing,2016,10(10):763-772.) over-segments an Image into a plurality of regions (regions), and a Bayesian classifier is utilized to realize an Image labeling algorithm based on features (region features) extracted from the Image regions. LIFT (Zhang M L, Wu L. LIFT: Multi-label learning with label-specific features [ J ]. IEEE transactions on pattern analysis and machine understanding, 2015,37(1):107-120.) the algorithm first constructs label-specific features for each class label, and then learns the classifier on the label-specific features for each label to accomplish the labeling task.

The method converts the labeling problem into the classification problem by regarding the labels as classes, and when the number of the labels in the data set is large, the method means a large classification output space, so that the method is not suitable any more. The LCLCLCKL (Gu Y, Qian X, Li Q, et al. Image indication by needle communication Detection and MultikernelLearning [ J ]. IEEE Transactions on Image Processing A Publication of the IEEESignal Processing Society,2015,24(11):3450.) adopts a fast unfolding algorithm to carry out hard classification on the labels, and trains MKL classifiers for different classes, thereby relieving the problem that the algorithm based on model learning is not suitable for data sets with a plurality of labels, but neglects the condition that the labels belong to a plurality of classes by adopting the hard classification algorithm, so that the label classification is unreasonable, the sample mapping is inaccurate, and the algorithm effect is poor.

Aiming at the problems that the mapping relation between the labels and the semantic scenes is not considered in the image labeling problem and the existing label hard classification problem is solved, the invention provides a label semantic scene division method based on non-negative matrix decomposition, and the probability mapping between the labels and the semantic scenes is realized. And then mapping the sample to be labeled to the sample subset related to the scene by using scene classification, and completing labeling by adopting a KNN (KNearest Neighbor) method. According to the method, the labeling is completed by the KNN in the sample set related to the sample scene, so that the KNN algorithm efficiency is improved, the noise interference is reduced, and the labeling effect is improved. Moreover, the number of scenes is far smaller than that of the labels, so that the problem that a model learning-based method is not suitable for a data set with huge labels is solved.

Disclosure of Invention

The invention provides an automatic image annotation method based on semantic scene classification, which aims at solving the problems that the mapping relation between labels and semantic scenes is not considered in the image annotation problem and the existing label hard classification problem. Firstly, scene detection is carried out by utilizing an NMF-based (non-negative matrix factorization-based) method according to label information in a training set, and the probability that a label belongs to each scene is obtained. And then mapping the samples to corresponding scenes in a probability mode according to the label information of the samples in the training set. Then, the obtained scene is regarded as different kinds of information, training set samples in the scene are used as training data, and a scene classifier is trained. And finally, carrying out scene classification on the samples in the test set according to the trained classifier, and completing labeling on the training subset corresponding to the TOP-2 most relevant scene by using a KNN method.

The technical scheme of the invention is as follows:

the embodiment of the invention provides an automatic image annotation method and a framework based on semantic scene classification.

1. And (5) feature extraction.

A variety of different features are extracted from the image, such as Gist (512D), DenseHue (100D), HarrisHue (100D), DenseShift (1000D), Harrissift (1000D).

2. Constructing a label relation graph, detecting scenes and determining the number of the scenes.

A. Building a tag relationship graph

Establishing a relation graph C between labels by using the formula (1):

N(c_i，c_j) Indicating the same time scale in the training set is labeled c_iAnd a label c_jNumber of samples of, N (c)_j) The representation being marked with a label c_jThe number of samples of (1). C_ijThe representation is marked with a label c_jIn the sample (2), label c is marked_iThe proportion of the sample (c).

B. Scene detection

And (3) establishing a non-negative matrix decomposition model according to the relation graph C and the formula (2), updating the formula (2) by using an updating rule (3) and a formula (4), and normalizing W by using a formula (5) after convergence.

WSW^T＝(WD^-1)(DSD^T)(WD^-1)^T(5)

K is the number of potential scenes, m is the number of tags,

formula (5) introduces a diagonal matrix D between W and S to normalize W, W_ikRepresenting the probability that sample i belongs to scene k.

C. Scene number determination

The method described in B is run by setting different scene numbers K. The obtained W matrix is converted to 0/1 (the maximum value in each row of W is set to 1, and the others are set to 0), and the community modularity is calculated according to the formula (6).

Wherein w_i，jRepresenting the connection weight between node i and node j,

represents the sum of all weights connected to node i; phi (node) when node i and node j belong to the same community_i，node_j) Equal to 1, otherwise equal to 0. The essence of scene detection in the present invention can be understoodFor the community detection of the label connection matrix, different value-calculating algorithm modularity values M are selected, and the maximum K of M is selected as the number of scenes.

3. Mapping samples to scenes

Given a training sample { x_i，y_iWhere x_iIs the feature vector, y, of the sample_iA label vector, y, representing the sample_i∈R^1*mIf the sample is labeled with the kth label, then y_ik1), mapping to each scene according to the label information of each sample. The invention assumes that each label of an image acts independently on which scene the image belongs to, and provides a strategy for mapping samples to scenes based on this assumption, i.e. it can be calculated that sample i belongs to scene S as follows (7)_kProbability of (c):

wherein, W_kAnd a k-th column representing W, which is calculated from section B in step 2.

Thus, for the training set { X, Y }, the scene information of all samples is obtained by using equation (8):

P＝V*(Y*W) (8)

wherein P ∈ R^n1*K，P_ikIndicating that sample i belongs to scene S_kIs a probability of V ∈ R^n1*n1Is normalized by a diagonal matrix (Y x W),

Y∈R^n1*mn1 represents the number of training set samples, and m represents the number of labels. Processing the P matrix into 0/1 to obtain Z epsilon R^K*n1Namely:

wherein P is_iRow i of P. The training set is changed from { X, Y } to { X, Y, Z }, where if Z is equal to_kiIf 1, it means that the scene to which the ith sample belongs is S_k。

4. Classifier training

Aiming at the multi-feature problem, the invention adopts a weighted voting method based on difference DE (search and separation evolution) and ELM (extreme learning machine), and the method comprises the following steps:

the invention adopts the weighted voting normative combination of the classification results of a plurality of characteristics as the final classification result, namely:

wherein C is^vAnd representing the classification result of the ELM classifier corresponding to the v-th view characteristic of the test set, namely: c^v＝g_elm(X^v)，θ_vWeight representing the classification result corresponding to the v-th feature, where θ ∈ R^V*1. The invention adopts a 5-fold cross validation mode to determine the weight theta and constructs the following objective function:

and (3) solving the parameter theta by adopting a DE algorithm optimization formula (11), wherein Z is the scene category information obtained in the step (3).

5. And marking the unmarked image.

And extracting the same characteristic information of the unlabeled image, and inputting the same characteristic information into the classifier trained in the step 4 to obtain a classification result. And running a KNN-based algorithm on training set samples in the two most relevant scenes to obtain a prediction label. The following is the labeling algorithm pseudo code used in this step.

The invention has the beneficial effects that:

according to the method, the labeling is completed by the KNN in the sample set related to the sample scene, so that the KNN algorithm efficiency is improved, the noise interference is reduced, and the labeling effect is improved. Moreover, the number of scenes is far smaller than that of the labels, so that the problem that a model learning-based method is not suitable for a data set with huge labels is solved.

Drawings

FIG. 1 is a flow chart of the algorithm;

FIG. 2 is a graph of scene detection loss function variation;

FIG. 3 is a schematic graph of the variation of baseline Iaprtc 12;

FIG. 4 is a graph of the effect of the baseline example Iaprtc12 for different numbers of nearest neighbor samples; (a) accuracy, (b) recall, (c) F1 value, (d) average accuracy;

FIG. 5 shows the effect curves of the baseline example Iaprtc12 for different numbers of hidden nodes; (a) accuracy, (b) recall, (c) F1 value, (d) average accuracy.

Detailed Description

The specific embodiments discussed are merely illustrative of implementations of the invention and do not limit the scope of the invention. Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.

The embodiment of the invention at baseline example Iaprtc12 is as follows:

the symbols represent: and a training set { X, Y }, a testing set { X '}, wherein X represents a sample feature matrix, Y represents training set label information, and X' represents a testing set feature matrix.

(1) Feature extraction

Five characteristics of Gist (512D), denseshue (100D), HarrisHue (100D), DenseSift (1000D) and HarrisSift (1000D) existing in reference example Iaprtc12 are used as the characteristic { X } of the present embodiment.

(2) Scene detection example

And (3) constructing a label relation graph C according to the formula (1) and the label matrix Y, establishing a non-negative decomposition model according to the formula (2) and the relation matrix C, updating the established model by using the formula (3) and the formula (4), and normalizing the obtained W by using the formula (5). FIG. 2 shows the loss value variation curve of the algorithm on a reference example Iaprtc 12; fig. 3 shows the values of the modularity criterion when K is 3, …, and 15. According to the embodiment of fig. 3, K-9 is selected as the number of scenes of the reference example, and W obtained by formula (2) when K-9 is used as the basis of the mapping sample.

(3) Sample mapping to scene

According to the W obtained in the above steps, mapping the samples into corresponding scenes according to the formula (8) and the formula (9), so as to train and integrate into { X, Y, Z }, wherein if Z is adopted, Z_kiIf 1, it means that the scene to which the ith sample belongs is S_k。

(4) Classifier training

And (4) training the classifier according to the formula (10) and the formula (11) by using the { X, Z } obtained in the last step.

(5) And marking the unmarked image.

And (4) for the test sample X ∈ X', firstly, finding two most relevant scenes according to the classifier trained in the step (4), and completing labeling in a sample subset in the two relevant scenes by utilizing a KNN algorithm. Fig. 4 shows the variation curves of the accuracy, recall rate, F1 value and average accuracy of the algorithm when the number of nearest samples is 10,20, … and 150 in the reference example. Fig. 5 shows an effect change curve of the benchmark example when the number of hidden layer nodes of the extreme learning machine takes values of 100, 200, …, and 1500.

Claims

1. An automatic image annotation method based on semantic scene classification is characterized by comprising the following steps:

step 1, feature extraction;

extracting visual features of different types from the image;

step 2, constructing a label relation graph, detecting scenes and determining the number of the scenes;

1) constructing a label relation graph C (C) according to label information in the training set_ij),C_ijIndicating the proportion of the samples marked with the labels i in the samples marked with the labels j;

2) according to the label relation chart C (C)_ij) Establishing a non-negative matrix decomposition model with the formula (2), randomly initializing W and S, updating the formula (2) by using an updating rule (3) and a formula (4), and normalizing W by using a formula (5) after convergence;

WSW^T＝(WD^-1)(DSD^T)(WD^-1)^T(5)

k is the number of potential scenes, m is the number of tags,

formula (5) introduces a diagonal matrix D between W and S to normalize W, W_ikRepresenting the probability that sample i belongs to scene k;

3) by taking different values for K, running the method described in 1), the resulting W matrix is 0/1, the maximum value in each row of W is set to 1, the others are set to 0, and the modularity value M is calculated using equation (6),

wherein the content of the first and second substances,

represents the sum of all weights connected to node i; phi (node) when node i and node j belong to the same community_i,node_j) Equal to 1, otherwise equal to 0; finally, selecting the K value with the maximum M value as the number of scenes;

step 3, mapping the sample to a scene;

mapping the samples to corresponding scenes according to the matrix W and the label matrix Y corresponding to the scene number K determined in the step 2; the method comprises the following specific steps:

assume each label pair map of an imageWhich scene the image belongs to acts independently and based on this assumption provides a strategy for mapping samples to scenes, calculating that sample i belongs to scene S_kProbability of (c):

wherein y is_iA label vector representing the sample is generated,

if the sample is labeled with the kth label, then y_ik＝1；W_kThe kth column of W is represented, and W is obtained by calculation of a formula (2); thus, for the training set { X, Y }, the scene information of all samples is obtained by using formula (8):

P＝V*(Y*W) (8)

wherein the content of the first and second substances,

P_ikdenotes the probability that the sample i belongs to the kth scene, V ∈ R^n1*n1Is normalized by a diagonal matrix (Y x W),

n1 represents the number of training set samples, and m represents the number of labels; processing the P matrix into 0/1 to obtain Z epsilon R^K*n1Namely:

wherein P is_iRow i of P; the training set is changed from { X, Y } to { X, Y, Z }, wherein if Z is equal to Z_kiIf 1, it means that the scene to which the ith sample belongs is S_k；

Step 4, training a classifier;

regarding the scenes to which the samples obtained in the step 3 belong as different categories, taking the samples in the scenes as training data, and training a classifier based on a differential evolution algorithm and an extreme learning machine;

step 5, labeling the unmarked images;

extracting the same characteristic information of the unlabeled image, and inputting the same characteristic information into the classifier trained in the step 4 to obtain a classification result; and running a KNN-based algorithm on training set samples in the two most relevant scenes according to the classification result to obtain a prediction label.