CN104036021A

CN104036021A - Method for semantically annotating images on basis of hybrid generative and discriminative learning models

Info

Publication number: CN104036021A
Application number: CN201410295467.3A
Authority: CN
Inventors: 李志欣; 张灿龙; 吴璟莉; 王金艳
Original assignee: Guangxi Normal University
Current assignee: Guangxi Normal University
Priority date: 2014-06-26
Filing date: 2014-06-26
Publication date: 2014-09-10

Abstract

The invention discloses a method for semantically annotating images on the basis of hybrid generative and discriminative learning models. The method includes generatively building models of the images by means of continuous PLSA (probabilistic latent semantic analysis) at generative learning stages, acquiring corresponding model parameters and subject distribution of each image, and utilizing the corresponding subject distribution as an intermediate representation vector of each image; constructing cluster classifier chains to discriminatively learn from the intermediate representation vectors of the images at discriminative learning stages, creating the classifier chains and integrating contextual information among annotation keywords; automatically extracting visual features of each given unknown image at annotation stages, acquiring representation of subject vectors of the given unknown images by the aid of a continuous PLSA parameter estimation algorithm, classifying the subject vectors by the aid of trained cluster classifier chains and semantically annotating the images by a plurality of semantic keywords with the highest confidence. The method has the advantage that the annotation and retrieval performance of the method are superior to the annotation and retrieval performance of most current typical methods for automatically annotating images.

Description

Mix the linguistic indexing of pictures method of production and discriminant learning model

Technical field

The present invention relates to field of image search, be specifically related to a kind of linguistic indexing of pictures method of mixing production and discriminant learning model.

Background technology

According to the feature of used machine learning method, existing image automatic annotation method is broadly divided into mask method based on production model (generative model) and the mask method based on discriminative model (discriminative model).

The feature of the mask method based on production model is: first learn the joint probability of characteristics of image and keyword, and the posterior probability of each keyword while then calculating given characteristics of image by Bayesian formula, and carry out image labeling according to posterior probability.These class methods have extendible training process, lower to the quality requirements of the artificial mark of training plan image set.

The feature of the mask method based on discriminative model is: suppose that graphic feature is certain parameterized function to the mapping between keyword, directly in the parameter of this function of training data learning, and obtain the sorter of each semantic concept.Each semantic concept is considered as independently classification by these class methods, in general can obtain higher mark precision, but be not easy to utilize the priori of domain-specific.

The probability graph model of the method based on production model and the method based on discriminative model is respectively as (a) in Fig. 1 with (b), the two is compared and mainly contains following some difference: (1) method based on discriminative model is regarded image as training data, each semantic concept is regarded classification as, object is to sort images in each semantic classes, and image and text are all considered as training data by method based on production model, its objective is associated between study image and text; (2) method based on discriminative model is a sorter of each semantic concept training, and the method based on production model is only learnt a correlation model and this model is applied to all semantic concepts; (3) independence assumption difference.It is separate that method based on discriminative model is supposed between each semantic classes, and method based on production model supposes under the condition of given hidden variable, visual element and text element be condition independently.

In sum, production model and discriminative model respectively have its advantage and defect.

Summary of the invention

The present invention is directed to " semantic gap " problem that exists in image retrieval and the defect of production model and discriminative model, a kind of linguistic indexing of pictures method of mixing production and discriminant learning model is provided, it proposes to mix the automatic image annotation model HGDM (hybrid generative/discriminative model) of production and discriminant study on the basis of continuous probability latent semantic analysis and Multi-label learning, and has further realized the Semantic Image Retrieval based on keyword.

For addressing the above problem, the present invention is achieved by the following technical solutions:

The linguistic indexing of pictures method of mixing production and discriminant learning model, comprises the steps:

(1) process of training image being trained,

(1.1) adopt the visual signature of continuous probability latent semantic analysis (PLSA) Method Modeling training image, obtain given theme z _kunder Gaussian Distribution Parameters μ _kand Σ _k, and the theme vector P (z of every width training image _k/ d _i);

(1.2) utilize the theme vector P (z of every width training image _k/ d _i) and original semantic tagger, adopt Multi-label learning method construct classifier chains;

(2) process test pattern being marked,

(2.1) the Gaussian Distribution Parameters μ that utilizes step (1.1) to obtain _kand Σ _k, and the visual signature of test pattern, adopt expectation maximization (Expectation Maximization, EM) method to calculate the theme vector P (z of every width test pattern _k/ d _new);

(2.2) classifier chains of utilizing step (1.2) to obtain, to this theme vector P (z _k/ d _new) carry out the semantic classification of test pattern;

(2.3) semantic tagger using an X the highest degree of confidence semantic classes as this test pattern; Wherein parameter X is artificial preset value.

Step (1.2) is the construction process of classifier chains, the training process that is classifier chains is specially: according to the flag sequence of specifying, the two-value sorter of the associated semantic key words mark of each circulation study, and each circulation all will add semantic key words label information corresponding to two-value sorter of having learnt, and constructs thus a two-value classifier chains; Wherein each two-value sorter C in this two-value classifier chains _jbe responsible for and semantic key words mark l _jrelevant study and prediction.Above-mentioned j=1,2 ... / L/ ,/L/ is the number of semantic key words.

Step (2.2) is semantic classification process, and the assorting process of classifier chains is specially: by the two-value classifier chains of constructing in sorter training process, from two-value sorter C ₁start constantly back-propagation, wherein two-value sorter C ₁determine semantic key words mark l ₁classification results Pr (l ₁| x); Again by this classification results Pr (l ₁| x) join in the theme vector of test pattern in the mode of two-value, by that analogy, follow-up two-value sorter C _jdetermine mark l _jclassification results Pr (l _j| x, l ₁, l ₂..., l _j-1), the theme vector that x is training image.Above-mentioned j=1,2 ... / L/ ,/L/ is the number of semantic key words.

In step (1.1) and (2.1), also further comprise the process of training image and test pattern being carried out to Visual Feature Retrieval Process,

First, every width image is divided into (m × n) individual regular square;

Then, for each square extracts the proper vector that (a+b) ties up, the textural characteristics of the color characteristic that the proper vector of this (a+b) dimension comprises a dimension and b dimension; Wherein color characteristic is the color auto-correlogram calculating on quantized color and city block distance, and textural characteristics is Jia Bai (Gabor) energy coefficient calculating in yardstick and direction;

Finally, the visual signature of every width image is (m × n) set of the visual feature vector of individual (a+b) dimension;

Wherein parameter m, n, a and b are artificial preset value.

Compared with prior art, the present invention is integrated production model and discriminative model in learning process, study for input picture visual signature adopts production model, and adopts discriminative model for the semantic learning process of image, thereby has following feature:

(1) at production learning phase, adopt continuous P LSA Direct Modeling Image Visual Feature, do not need to carry out the quantizing process of visual signature, thereby can not lose important visual information.

(2) theme vector that continuous P LSA is transformed to a K dimension by image from the expression of characteristic set represents, also can be considered as the process of a dimensionality reduction.And the expression of this theme vector also integrated the implicit semantic information of image vision content, be significant for the semantic retrieval of image.

(3) build sorter based on Multi-label learning method, the association when image is classified between integrated image labeling keyword.Can be good at solving weak mark problem, training set scale is had to extensibility.

(4) adopt discriminative model cluster classification device chain to carry out the semantic classification of image, wherein each two-value sorter builds based on support vector machine (SVM), so operational efficiency and nicety of grading are all higher.

Brief description of the drawings

Fig. 1 is that the probability graph model of two class image automatic annotation methods represents: be (a) method based on discriminative model; (b) be the method based on production model.

Fig. 2 is the automatic image annotation framework that mixes production and discriminative model.

Embodiment

A kind of image automatic annotation method that mixes production and discriminative model.At production learning phase, adopt continuous P LSA to carry out production modeling to image, the priori of training set can be made full use of, and the theme distribution of corresponding model parameter and every width image can be obtained; This theme is distributed as the intermediate representation vector of every width image, the problem of automatic image annotation is just converted into a classification problem based on Multi-label learning so, to obtain the mark precision higher than production model again.At discriminant learning phase, use the method for structure cluster classification device chain to carry out discriminant study to the intermediate representation vector of image, contextual information in setting up classifier chains between also integrated mark keyword, when image is classified, also consider like this association between image labeling, thereby can obtain higher mark precision and better retrieval effectiveness.In the mark stage, a given width unknown images, can obtain the expression of its theme vector by the parameter estimation algorithm of automatic extraction visual signature and continuous P LSA; Re-using the cluster classification device chain training classifies to this theme vector; Finally, the semantic tagger using some semantic key words the highest degree of confidence as image.

Mix production and the study of discriminative model (HGDM) and the framework of mark as shown in Figure 2.

Be divided into two steps for the training process of training image: first, utilize the visual signature of continuous P LSA modeling training image, obtain given theme z _kunder Gaussian Distribution Parameters μ _kand Σ _k, and the distribution of the theme of every width training image represents P (z _k/ d _i), this is the learning process of a production.Here the Gaussian Distribution Parameters μ obtaining _kand Σ _kthe parameter of continuous P LSA, still effective for the image outside training set by these parameters of independence assumption of continuous P LSA, and theme distribution represents P (z _k/ d _i) only corresponding to the character of every width training image itself, can not bring prior imformation to test pattern.But, can utilize this theme vector (K is potential theme number) that represents every width training image to be expressed as a K dimension, the space that these vectors form is a simple form (simplex).Then, utilize the theme vector of every width training image to represent and their original mark structural classification device, each class is corresponding to the semantic classes in text vocabulary, and this is the learning process of a discriminant.Because now every width image is all represented by a theme vector, but corresponding to multiple keyword marks, consistent with the situation of Multi-label learning, so adopted the method construct multicategory classification device of Multi-label learning, the related information between simultaneously also integrated keyword.

Also be divided into similar two steps for the mark process of test pattern: (1) first, utilizes the model parameter μ that the training stage obtains _kand Σ _kand the visual signature of test pattern, use expectation maximization (Expectation Maximization, EM) algorithm to calculate the theme vector P (z of every width test pattern _k/ d _new).(2) then, utilize the sorter that obtains of training to this theme vector classification, and semantic tagger using semantic classes the highest some degree of confidence of gained as this test pattern.

First Visual Feature Retrieval Process method of the present invention is divided into every width image of data centralization regular square (square size is by verifying that collection is defined as 16 × 16), then be the proper vector that each square extracts one 36 dimension, the textural characteristics of the color characteristic that comprises 24 dimensions and 12 dimensions, color characteristic is the color auto-correlogram calculating on 8 quantized colors and 3 city block distances, and textural characteristics is the Gabor energy coefficient calculating in 3 yardsticks and 4 directions.So, each square can be expressed as the proper vector of one 36 dimension, and every width image just can be expressed as one " feature bag ", the namely set of the visual feature vector of several 36 dimensions, thus provide accordant interface for further using topic model to carry out modeling.

At production learning phase, the theme number setting of continuous P LSA is very important, because the number that is the theme has determined the dimension of the intermediate representation of image.The excessive efficiency that can reduce system of this number, the too small image information of can losing.Because the matching of continuous P LSA is more time-consuming, the present invention has chosen five theme numbers (being respectively 90,120,150,180 and 210) and has tested, experimental result shows, in the time that theme number is 180, system performance, than using other theme number fashions, determines that the theme number using is 180 so final.

At discriminant learning phase, HGDM adopts the method for the cluster classification device chain in Multi-label learning method to carry out multiple labeling classification, and wherein each two-value sorter uses support vector machine (SVM) to realize.This method can be considered interrelated between multiple labeling and have acceptable computation complexity.

Classifier chains (classifier chain, CC) relevant to two-value (binary relevance, BR) method is the same, comprises | and L| two-value sorter, each sorter is processed the two-value relevant issues of a mark.But different from BR method, these two-value sorters all couple together by a chain, wherein the feature space of each node is relevant with the class mark of node above.

The training process of classifier chains is as shown in table 1, and training sample is expressed as (x, S) here; Wherein S is several semantic key words set of training image mark, l is all semantic key words set; Element in S can be semantic key words mark l with binary set _j(l ₁, l ₂..., l _{/ L/}) represent, x is theme vector.In algorithm, according to the flag sequence of specifying, the two-value sorter of the associated mark of each circulation study, the more important thing is, each cycle specificity space all will add label information corresponding to two-value sorter of having learnt, thereby characteristic information is constantly enhanced, last, can construct a two-value classifier chains, each the sorter C in this two-value classifier chains _jbe responsible for and mark l _jrelevant study and prediction, a two-value sorter is responsible for a semantic key words.j＝1,2,……|L|。

The assorting process of classifier chains is as shown in table 2, from two-value sorter C ₁start, then back-propagation constantly.Two-value sorter C ₁determine mark l ₁classification results Pr (l ₁| x), then this result is added to the feature of test sample book in the mode of two-value, follow-up sorter is determined mark l _jclassification results Pr (l _j| x, l ₁, l ₂..., l _j-1).

Use the method for chain can between sorter, transmit label information, consider the related information between mark simultaneously, thereby can overcome the mark question of independence in BR method.And classifier chains still keeps the advantage of BR method, comprise that storage demand is low and operational efficiency is high.

Although on average will increase for each example | the characteristic amount of L|/2 dimensions, due in practice | L| is generally a limited value, thereby the computation complexity problem being caused by this reason is almost negligible.The computation complexity of classifier chains and BR method are very approaching, depend on the number of mark and the complexity of basic two-value sorter.The complexity of BR method is O (| L| × f (| X|, | D|)), and wherein f (| X|, | D|) is the complexity of basic two-value sorter.The complexity of classifier chains is O (| L| × f (| X|+|L|, | D|)), namely many | L| ties up additional eigenwert.And HGDM adopt SVM as basic two-value sorter, so the complexity of classifier chains can be reduced to O (| L| × | X| × | D|+|L| × | L| × | D|).Can see, as long as | L|<|X|, Section 1 will play a major role.Like this computation complexity of classifier chains be O (| L| × | X| × | D|), identical with the computation complexity of BR method.And only have when | when L|>|X|, the computation complexity of classifier chains just can be higher than BR method.

In addition, although the process of chain type means that classifier chains can not parallelization, it can serialization, namely at any time in internal memory, only need to retain a two-value sorter, and this is an obvious advantage for contrast method for distinguishing.

The order of classifier chains obviously can affect its precision.Although there are some heuritic approaches to determine the order of chain, we still adopt concentrating type framework to solve this problem.Adopt concentrating type method can improve overall precision, avoid over-fitting, also can realize parallelization.Here said cluster refers to the cluster of multiple labeling method, namely the cluster of classifier chains.

M classifier chains C of cluster classification device chain training ₁, C ₂..., C _m, wherein each classifier chains is obtained by a random subset training of a random chain sequence and training set.Therefore each MODEL C _kbe all mutually different and can provide different multiple labeling classification results.These classification results are made to read group total according to mark, and each mark can obtain some ballots so.Use a threshold value to select the highest mark of poll can form a multiple labeling set, and using this as final classification results.

If predicting the outcome as vectorial y of k independent model _k=(l ₁, l ₂..., l _{/ L/}) ∈ { 0,1} ^{| L|}.Can obtain vectorial W=(λ to all model summations ₁, λ ₂..., | L|) ∈ R ^{| L|}, here .Therefore each j ∈ W has represented mark l _jthe result of ∈ L ballot.Vectorial W is obtained to W do normalization ^norm, just can obtain a distribution on each being marked at [0,1].After finishing the judgement of threshold value, can distribute and do a sequence according to this.With other automatic image annotation model class seemingly, HGDM gets the semantic tagger of front 5 keyword tag that degree of confidence is the highest as image.

In Corel5k image data base, a cluster that comprises 90 classifier chains of this method structure, what each classifier chains was random choose a subset that comprises 500 width images trains.And while testing, use the cluster of 150 classifier chains on data set IAPR-TC12 and MIRFLICKR25000, what each classifier chains was random choose a subset that comprises 1000 width images trains.In addition, the two-value sorter of each node representative in classifier chains uses LIBSVM software package to realize, select RBF kernel function K (x, x')=exp (γ || x-x'||2), corresponding parameter is defined as by grid search method: (C, γ)=(27,21), wherein C is error penalty factor, and γ is kernel functional parameter.

By reasonable design learning framework, the linguistic indexing of pictures method of mixing production and discriminant learning model effectively combines the learning method of production and discriminative model and has inherited their advantages separately, has obtained better performance.Experimental result shows, the linguistic indexing of pictures method of mixing production and discriminant learning model had both possessed production model and can make full use of the advantage of training data, also can as discriminative model, can obtain higher nicety of grading, its mark and retrieval performance are better than current most of typical image automatic annotation method.

Claims

1. the linguistic indexing of pictures method of mixing production and discriminant learning model, is characterized in that, comprises the steps:

(1) process of training image being trained,

(1.1) adopt the visual signature of continuous probability latent semantic analysis Method Modeling training image, obtain given theme z _kunder Gaussian Distribution Parameters μ _kand Σ _k, and the theme vector P (z of every width training image _k/ d _i);

(2) process test pattern being marked,

2. the linguistic indexing of pictures method of mixing production according to claim 1 and discriminant learning model, it is characterized in that, step (1.2) is specially: according to the flag sequence of specifying, the two-value sorter of the associated semantic key words mark of each circulation study, and each circulation all will add semantic key words label information corresponding to two-value sorter of having learnt, and constructs thus a two-value classifier chains; Wherein each two-value sorter C in this two-value classifier chains _jbe responsible for and semantic key words mark l _jrelevant study and prediction; Above-mentioned j=1,2 ... / L/ ,/L/ is the number of semantic key words.

3. the linguistic indexing of pictures method of mixing production according to claim 2 and discriminant learning model, is characterized in that, step (2.2) is specially: by the two-value classifier chains of step (1.2) structure, from two-value sorter C ₁start constantly back-propagation, wherein two-value sorter C ₁determine semantic key words mark l ₁classification results Pr (l ₁| x); Again by this classification results Pr (l ₁| x) join in the theme vector of test pattern in the mode of two-value, by that analogy, follow-up two-value sorter C _jdetermine mark l _jclassification results Pr (l _j| x, l ₁, l ₂..., l _j-1), the theme vector that x is training image; Above-mentioned j=1,2 ... / L/ ,/L/ is the number of semantic key words.

4. according to the linguistic indexing of pictures method of the mixing production described in any one in claim 1～3 and discriminant learning model, it is characterized in that, in step (1.1) and (2.1), also further comprise the process of training image and test pattern being carried out to Visual Feature Retrieval Process,

First, every width image is divided into (m × n) individual regular square;

Wherein parameter m, n, a and b are artificial preset value.

5. the linguistic indexing of pictures method of mixing production according to claim 4 and discriminant learning model, is characterized in that, parameter m and n are all made as 16, and parameter a is made as 24, and parameter b is made as 12; Be that every width image is all divided into 16 × 16 regular squares, each square extracts the proper vector of one 36 dimension, the textural characteristics of the color characteristic that the proper vector of these 36 dimensions comprises 24 dimensions and 12 dimensions.

6. the linguistic indexing of pictures method of mixing production according to claim 1 and discriminant learning model, is characterized in that, in step (1.1), in the time of continuous probability latent semantic analysis, set theme number is 180.

7. the linguistic indexing of pictures method of mixing production according to claim 1 and discriminant learning model, it is characterized in that, in step (2.3), parameter X is made as 5, the semantic tagger by the highest 5 semantic classess of degree of confidence as this test pattern.