CN108984726A

CN108984726A - A method of the sLDA model based on extension carries out title annotation to image

Info

Publication number: CN108984726A
Application number: CN201810759844.2A
Authority: CN
Inventors: 秦丹阳; 冯攀; 纪萍; 马静雅; 张岩; 杨松祥
Original assignee: Heilongjiang University
Current assignee: Shenzhen Litong Information Technology Co ltd
Priority date: 2018-07-11
Filing date: 2018-07-11
Publication date: 2018-12-11
Anticipated expiration: 2038-07-11
Also published as: CN108984726B

Abstract

The method that the present invention relates to a kind of sLDA model based on extension carries out title annotation to image, it is to solve existing annotation of images method and can encounter scalability issues, a small annotation vocabulary can only be handled, the shortcomings that lacking universal and ease for use and propose, it include: the image for input, the local feature of image is extracted, and obtains N number of visual vocabulary of image using K-means algorithm；The Posterior distrbutionp of given document hidden variable is indicated using LDA model；It introduces response variable and response variable distribution is defined as multivariable Bernoulli Jacob distribution；It is approximate that formula is carried out using the LDA reasoning variational algorithm based on convexity；Seek variational parameter value；Estimate model parameter；The distribution of predicated response variable.The present invention is suitable for image header annotation system.

Description

A method of the sLDA model based on extension carries out title annotation to image

Technical field

The present invention relates to annotation of images method fields, and in particular to a kind of sLDA model based on extension marks image The method that note is released.

Background technique

In in the past few decades, the problem of image and video frequency searching, is constantly in the forward position of computer vision research.To the greatest extent For pipe in this way, since nearest largely picture and video can be found on the net, people are receiving a kind of efficient algorithm on a large scale Search and the demand of navigation is concentrated also constantly to increase.Current state-of-the-art image search engine is depended critically upon using band annotation Text or title identify and retrieve image.Although this method allows to carry out high-level semantic query, for being based on text Search technique the vital heading message of success, usually obtain manually, and this process cannot be with current more The ever-increasing scale of media corpus and extend.Therefore, it is necessary to automate this annotation procedure.Since it is to being related to digital matchmaker The potential impact of the extensive application program of body archives, people are to the automatic chemical industry for designing and developing annotating images and video in recent years The attention rate of tool is growing day by day.

In the case where no title, annotate algorithm task be by study image and text between association mode come Predict the title of missing.The work in this field can be roughly divided into two groups in the past.In first group of work, annotation of images Problem is converted into a supervised learning problem, and in this problem, annotation will be taken as concept class.For every in vocabulary A word, class sigma-t be from learn in markd image.In annotation procedure, the posteriority point of class label is calculated Then cloth uses the concept of maximum probability as the title of prediction.In practice, this method can encounter scalability issues, And a small annotation vocabulary can only be handled, because each word must be learned by class sigma-t.

Another group, by modeling to the joint statistic correlation between two data types, is handled on the basis of more equality Annotation and image data.These models use a potential variable frame, by assuming that each document have one group it is hiding The factor controls the association between characteristics of image and corresponding heading, to understand the joint probability point of text and characteristics of image Cloth.

Summary of the invention

The purpose of the present invention is to solve existing annotation of images methods can encounter scalability issues, can only handle one A small annotation vocabulary, the shortcomings that lacking universal and ease for use, and propose a kind of sLDA model based on extension to image The method for carrying out title annotation is capable of handling the multidimensional binary response variable of annotation data, comprising:

Step 1: extracting the local feature of image, and obtain the N of image using K-means algorithm for the image of input A visual vocabulary w_n, wherein w_n∈{1,2...,N}。

Step 2: indicating the Posterior distrbutionp of given document hidden variable using LDA model.

Wherein α and β is model parameter, and z and θ are theme variable and theme ratio respectively.

Step 3: introducing the parameter η and δ of response variable y and response variable in step 2, while will and response be become Amount distribution is defined as multivariable Bernoulli Jacob distribution, i.e., indicates formula (3) are as follows:

Step 4: according to the LDA reasoning variational algorithm based on convexity by formula (5) similar toWherein Di Li Cray parameter γ and multiple parameters (φ₁,φ₂,...,φ_N) it is freedom Variational parameter；z_nFor n-th of descriptor；The desired difference of logp (θ, z, w | α, β, η, δ) and q (θ, z | γ, φ) is denoted as L。

Step 5: seeking the variational parameter γ and φ that the lower bound of L can be made to reach maximum value.

Step 6: estimation model parameter ψ={ α, β, η, δ }.

Step 7: according to the distribution p of model parameter ψ and variational parameter γ and φ predicated response variable y (y | w).

Further, step 3 specifically:

It utilizesη and δ generates response variable y, whereinIf the distribution of response variable y meets generalized linear Model:

WhereinThen formula (3) can be expressed as

Wherein

Further, step 4 specifically:

It is approximately by formula (5) by following formula

Enable L (γ, φ；α, β) expression (8) the right, formula (8) is expressed as

Logp (w | α, β)=L (γ, φ；α,β)+D(q(θ,z|γ,φ)||p(θ,z|w,α,β)) (9)

L is write by formula (10) by using the factorization of p and q:

Further, step 5 specifically:

Step 5 one, in formula (13), utilize φ_niMaximize the lower bound of L, φ_niIndicate n-th of visual vocabulary by hidden The probability that theme i is generated is hidden, thereforeAnd φ is included by separation_niItem and add Lagrange multiplier appropriate To form Lagrangian:

ψ (x) is double gamma functions；

It calculates about φ_niDerivative:

Wherein β_ivIt indicates for suitable v'sV is v-th of word of dictionary；

It further obtains in the case where response variable obeys Bernoulli Jacob's distribution occasion, parameter phi_nMore new formula

Step 5 two utilizes γ_iMaximize above formula, γ_iIndicate i-th of component part of posteriority Di Li Cray parameter；Include γ_iItem:

To γ_iDerivation:

Enabling derivative is zero:

Iterative equation (16) to (19) is restrained until boundary, and then obtains the variation that the lower bound of L can be made to reach maximum value Parameter γ and φ.

Further, step 6 specifically:

Step 6 one, the formula for acquiring parameter beta are as follows:

Step 6 two, the process for acquiring parameter alpha are as follows: for formula (22),

Derivation is carried out to obtain

The value of α is sought by Newton iteration method to formula (23)；

Step 6 three acquires parameter η and σ²Process are as follows:

Wherein μ ()=E_GLM[Y|·]；

To σ²Derivation,Upper assessment

By calculating, parameter estimation result is finally obtained:

By parameter alpha_i、β_ij、η_iAnd δ_iIt is combined and model parameter ψ={ α, β, η, δ } can be obtained.

Further, step 7 specifically:

To not have headed new document w as input, task is to be inferred to most probable heading, utilizes φ_nWith q (θ) Come approximate solution conditional probability p (y | w), as follows:

WhereinP (y | w) for inferring the new most probable heading of document w.

The invention has the benefit that

1, the present invention is made that adjustment to the structure of Corr-LDA, has deleted variable x, image subject is used directly for pre- Mark topic, is integrated (and this is the step of Corr-LDA needs) without the posterior probability to title.And to sLDA into Row extension, so that model is capable of handling multivariable binary responses variable, a response variable can only be handled by eliminating sLDA Deficiency, for image annotation in further detail, therefore image retrieval is also more convenient and accurate.

2, in the case where number of topics, vocabulary number are more, predictablity rate of the invention is apparently higher than Corr-LDA model, Averagely it is higher by 0.04.

Detailed description of the invention

Fig. 1 is the graphical model structure chart of sLDA-bin；

Fig. 2 is the graphical model structure chart of Corr-LDA；

Fig. 3 is the error curve diagram between the response of prediction and the observation of sLDA-bin and LDA；

The heading prediction graph of Corr-LDA and sLDA-bin when Fig. 4 is K=30；

The heading prediction graph of Corr-LDA and sLDA-bin when Fig. 5 is N=256；

The heading prediction graph of Corr-LDA and sLDA-bin when Fig. 6 is N=512；

Fig. 7 is the accuracy rate curve graph that Corr-LDA and sLDA-bin annotates partial objects.

Specific embodiment

The method that sLDA model based on extension of the invention carries out title annotation to image, may be simply referred to as sLDA-bin, Include:

Step 2: indicating the Posterior distrbutionp of given document hidden variable using LDA model:

Step 4: according to the LDA reasoning variational algorithm based on convexity by formula (5) similar toWherein Di Li Cray parameter γ and multiple parameters (φ₁,φ₂,...,φ_N) it is freedom Variational parameter；z_nFor the theme variable of n-th of word；By the desired difference of logp (θ, z, w | α, β, η, δ) and q (θ, z | γ, φ) Value is denoted as L, and determines the representation of L lower bound.

Step 6: estimation model parameter ψ={ α, β, η, δ }.

The principle and process of present embodiment is specifically described below, it should be noted that footmark is attached to the variable of n and not attached Variable meaning with n is identical, and difference is that footmark is attached to the variable of n and emphasizes ordinal number n, that is, emphasizes that this variable is N number of word In n-th of corresponding parameter.For example, variable z and z_nIt is identical meaning, difference is z_nIt is emphasised that the theme of n-th of word Variable, z do not emphasize this point, and they be meant that it is identical.

Step 1: data indicates, N number of local feature of image is extracted, then N number of feature is clustered with K-means, Known k initial average point m₁,...,m_k, alternately according to following two steps:

Distribution: each observation point is assigned in cluster, so that quadratic sum reaches minimum in organizing.Because this quadratic sum is just Euclidean distance after being square, so very intuitively observation is assigned to from its nearest average point

Wherein each x_pAll only it is assigned to a determining clusterIn, although theoretically it may be assigned to 2 A or more clusters.

It updates: for each cluster obtained in the previous step, with the center of fiqure of observation in cluster, as new average point

It has converted images into after the document being made of visual vocabulary, has carried out document annotation using LDA model.LDA is A kind of typical bag of words, i.e. a, it considers that document is a set being made of one group of word, without suitable between word and word Sequence and successive relationship.One document may include multiple themes, each word is given birth to by one of theme in document At.

Step 2: the mode of a document structure tree is as follows in LDA model:

Sampling generates the theme distribution θ of document i from Di Li Cray distribution α_i, from the multinomial distribution θ of theme_iMiddle sampling Generate the theme z of j-th of word of document i_i,j, sampling generates theme z from Di Li Cray distribution β_i,jWord distributionFrom word The multinomial distribution of languageMiddle sampling generates vocabulary w_i,j

In order to use LDA, the crucial reasoning problems for needing to solve are the posteriority point for giving a document calculations hidden variable Cloth:

Step 3: utilizingη and δ generates response variable y, whereinThis variable, which can be, gives film Star quantity downloads the number of online article or the classification of document.We combine modeling document and response, most can be pre- to find Survey the potential theme of the response variable of the following unmarked document.

SLDA adapts to various types of responses using probability mechanism identical with generalized linear model: unrestricted reality Actual value, the restrained true value (such as fault time) being positive, orderly or unordered class label, nonnegative integer (such as count number According to) and other types.The distribution of response variable is a generalized linear model (GLM):

GLM frame improves flexibility, as long as the distribution of response variable can be write as form above, so that it may model. The corresponding different h (y, δ) of different distribution and

Then (3) formula becomes

Wherein

But combination of the θ and β on potential theme is difficult to directly calculate, using the LDA reasoning variational algorithm based on convexity.

In order to handle the multivariable binary responses variable of annotation data, response variable can be distributed and be modeled as multivariable Bernoulli Jacob is distributed, and its probability is defined with logical connection function.Wherein y_i∈{0, 1 },It is logical function,{a_i,μ_iBe vocabulary i regression coefficient.

Step 4: variation calculates.The basic thought of variation reasoning based on convexity is to utilize the acquisition pair of Jensen inequality The adjustable lower bound of number likelihood.This race is characterized as being following variation distribution:

Wherein Di Li Cray parameter γ and multiple parameters (φ₁,φ₂,...,φ_N) it is all free variational parameter.

It is one optimization problem of setting in next step after specifying simplified family of probability distribution, which determines that variation is joined The value of number γ and φ.The requirement for finding the most compact lower bound of log-likelihood is converted into following optimization problem

By minimizing the opposite entropy minimization between variation distribution and true Posterior distrbutionp p (θ, z | w, α, β), completion is above-mentioned Target.

Carry out the log-likelihood of restricted document by using Jensen inequality first.For the sake of simplicity, parameter γ is omitted And φ, have

Jensen inequality is that any variation is distributed the log-likelihood of q (θ, z | γ, φ) and provides a lower limit.Above formula Left-right difference, that is, variation distribution and the really relative entropy between Posterior distrbutionp.The right first item is the joint to hiding and observation variable The expectation of the logarithm of probability；Section 2 is the entropy of variation distribution, H (q)=- E_q[logq(θ,z)].Enable L (γ, φ；α, β) it indicates The right of above formula, above formula become

Logp (w | α, β)=L (γ, φ；α,β)+D(q(θ,z|γ,φ)||p(θ,z|w,α,β)) (9)

This shows to maximize lower bound L (γ, φ by γ and φ；α, β) it is equivalent to minimize variation posterior probability and true Relative entropy between posterior probability.

Lower bound is extended by using the factorization of p and q:

Then above formula is launched into the equation about model parameter (α, β) and variational parameter (γ, φ).First three items are

Variation distribution entropy be

Note that the variation objective function Section 4 of sLDA is different from LDA, that is, give the response variable of potential theme distribution It is expected that log probability.

It can be seen that it is related with two expectations to calculate lower bound.First is contemplated to beSecond is contemplated to beIt can directly be calculated in some models, but usually require approximation.Specific mode wants the class of combining response Type.

Here logical function it is expected that the calculating of E [logp (y | z, η, δ)] becomes complicated.Using convex duality and incite somebody to action Logical function is expressed asChi square function a point supremum.More specifically,

Wherein,

Step 5: maximizing L lower bound by variational parameter γ and φ；

Step 5 one utilizes φ_niIt maximizes, φ_niIndicate the probability that n-th of visual vocabulary is generated by hiding theme i, thereforeIt include φ by separation_niItem and add Lagrange multiplier appropriate to form Lagrangian:

For simplicity, the parameter of L is abandoned, and subscript φ_niIndicate that we those of only remain in L item, this is φ_niFunction.It obtains about φ_niDerivative:

Wherein β_ivIt indicates for suitable v'sV is v-th of word of dictionary.As can be seen that for φ_ni Update rely onWhen some, such as Gauss or Poisson distribution, we can obtain accurate result, He when, need the optimization method based on gradient.

It calculates in the case where response variable obeys Bernoulli Jacob's distribution occasion, parameter phi_nMore new formula

Step 5 two utilizes γ_iMaximize above formula, γ_iIndicate i-th of component part of posteriority Di Li Cray parameter.Include γ_iItem:

To γ_iDerivation,

Enabling derivative is zero,

Since equation (19) depend on variation multinomial φ, complete variation reasoning needs between equation (16) and (19) Alternately.Until boundary is restrained.

Step 6: parameter Estimation.

Step 6 one realizes that lower bound maximizes using condition multiple parameters β.It extracts the item containing β and Lagrange is added Multiplier.

To β derivation and to enable derivative be zero, is obtained

Finally acquire

Step 6 two realizes that lower bound maximizes using Di Li Cray parameter alpha.Item containing α is as follows

To α_iDerivation obtains

Derivative relies on α_j, wherein i ≠ j, therefore the method that we must use iteration find qualified α.Particularly, Its Hessian matrix can write H=diag (h)+1z1^ΤForm,

Therefore linear session Newton iterative can be called.Then the update of α can write

α_new=α_old-H(α_old)^-1g(α_old) (25)

Wherein H (α) and g (α) is the Hessian matrix and gradient of point α respectively.

Multiplied by gradient, i-th of component is obtained:

Wherein

Step 6 three estimates GLM parameter.GLM parameter is η and σ²。

Wherein μ ()=E_GLM[Y|·]。

To σ²Derivation,Upper assessment:

It can be with calculation equation (29), because exactly or approximately being had evaluated while being optimized to coefficient η The summation of rightmost.According to h (y, δ) and its to the partial derivative of δ, we are availableIt can be closing form, can also To be one dimensional numerical optimization.By calculating, the parameter estimation result of model is finally obtained:

Step 7: being predicted.Given one does not have headed new document w, and task is to be inferred to most probable title, i.e., Response variable y.In order to complete this task, we calculate using variation until convergence on document, and utilize φ_nCome with q (θ) Approximate solution conditional probability p (y | w), it is as follows:

WhereinBy retaining title, we release φ from test image_n.And we compare The performance of sLDA-bin and Corr-LDA.For the title quality of measurement model, we calculate title prediction as defined below Probability

The protrusion effect of the present invention compared with the prior art can be specifically described according to Fig. 1 to Fig. 7:

Fig. 1 shows the graphical model structure of sLDA-bin, due to sLDA model heading is predicted it is outstanding Performance, we use the structure of sLDA, it can be seen that sLDA-bin is made that adjustment to the structure of Corr-LDA, deletes Variable, image subject are used directly for prediction title, are integrated (Corr-LDA need without the posterior probability to title The step of wanting).And sLDA is extended, so that model is capable of handling multivariable binary responses variable, eliminate sLDA only The deficiency that a response variable can be handled, for image annotation in further detail, therefore image retrieval is also more convenient and accurate.

The structure for the Corr-LDA that Fig. 2 is shown, Corr-LDA is a kind of currently popular LDA model.Not with SLDA-bin Together, Corr-LDA limits each heading associated with a specific image region.It can be seen that each heading is logical It crosses and selects an image-region come what is generated first, be embodied in the x variable additionally having more in Fig. 2.

Estimated performance analysis under different themes number:

Fig. 3 illustrates the comparison of sLDA-bin model and unsupervised LDA model in prediction film review.We use five Roll over the quality of cross validation assessment prediction.We measure the relevance between prediction and response variable.Relevance is stronger, illustrates to predict Closer with response variable observation, forecast quality is higher.Figure it is seen that when number of topics is lower, unsupervised LDA's Forecast quality is higher than sLDA-bin, but increasing along with number of topics, the forecast quality of sLDA-bin can be promoted rapidly, surpasses More unsupervised LDA model.This is because sLDA-bin uses the structure of sLDA, the theme variable and response variable of model are straight Connect relevant, therefore number of topics increases, and facilitates us and obtains the more information about response variable, for correctly predicted response Variable has valuable help

Fig. 5 and Fig. 6 is respectively shown under vocabulary number N=256 and 512, the shadow of the prediction probabilities of two kinds of models by number of topics K It rings.Five groups of data between we have randomly selected from K=5 to K=60 are compared.First it can be seen that, either N=256 Or N=512, the prediction curve of sLDA-bin are consistently higher than Corr-LDA, and good predictive ability is attributed to the fact that image by us Direct correlation model between theme and corresponding title.By eliminating hidden variable, image subject is used directly for predicting Title is integrated (step required for Corr-LDA) without the posterior probability to header topics.It is fixed in vocabulary number In the case where, sLDA-bin is more sensitive to number of topics, and with the growth of number of topics, sLDA-bin prediction effect, which has, obviously to be mentioned It rises, but if number of topics is excessive, will lead to prediction decline.This is because the response variable and theme of sLDA-bin have direct pass Connection, excessive number of topics will lead to over-fitting, and prediction effect is bad instead.So sLDA-bin performance, which plays, depends on number of topics With the cooperation of vocabulary number.

Estimated performance analysis under different vocabulary numbers:

As shown in Figure 4, fixed number of topics k=30, when vocabulary number N is respectively 128,256 and 512, sLDA-bin is Give the prediction effect better than Corr-LDA model.Then we conclude that, as we increase view in visual dictionary Feel vocabulary quantity, we also obtain better prediction probability, this point comparison diagram 5 and Fig. 6 it is also seen that.Because of vocabulary Several increases, model can obtain the detailed information of more images, thus preferably know the content information of image, it is more preferable to carry out The matching of image-title.

Image classification performance evaluation:

Fig. 7 shows that sLDA-bin model and Corr-LDA model, can be with to the annotation accuracy rate of part common objects Find out, in each single item, the annotation accuracy rate of sLDA-bin is all higher than Corr-LDA.On the whole, the accuracy rate of sLDA-bin Averagely exceed Corr-LDA about 0.04.Because Corr-LDA limits each heading associated with a specific image region. In fact, some annotation words describe entire scene on the whole, it the use of this restrictive association model is to be grossly inaccurate.It is logical Cross and recurrence operation carried out to theme ratio, sLDA-bin allow each heading by from all image-regions theme and spy Determine the influence of image-region, this depends on corresponding regression coefficient.Therefore, correlation model of the invention is more general, accurately How reflection generates the process really annotated.

The present invention can also have other various embodiments, without deviating from the spirit and substance of the present invention, this field Technical staff makes various corresponding changes and modifications in accordance with the present invention, but these corresponding changes and modifications all should belong to The protection scope of the appended claims of the present invention.

Claims

1. a kind of method that sLDA model based on extension carries out title annotation to image characterized by comprising

Step 1: extracting the local feature of image, and obtain N number of view of image using K-means algorithm for the image of input Feel vocabulary w_n, wherein n ∈ { 1,2..., N }, N are positive integer；

Wherein α and β is model parameter, and z and θ are theme variable and theme ratio respectively；

Step 3: introducing the parameter η and δ of response variable y and response variable in step 2, while will and response variable be divided Cloth is defined as multivariable Bernoulli Jacob distribution, i.e., indicates formula (3) are as follows:

Step 4: according to the LDA reasoning variational algorithm based on convexity by formula (5) similar toWherein Di Li Cray parameter γ and multiple parameters (φ₁,φ₂,...,φ_N) it is freedom Variational parameter；z_nFor the theme variable of n-th of word；By the desired difference of logp (θ, z, w | α, β, η, δ) and q (θ, z | γ, φ) Value is denoted as L；

Step 5: seeking the variational parameter γ and φ that the lower bound of L can be made to reach maximum value；

Step 6: estimation model parameter ψ={ α, β, η, δ }；

2. the method that the sLDA model based on extension carries out title annotation to image according to claim 1, it is characterised in that: Step 3 specifically:

WhereinThen formula (3) can be expressed as

Wherein

3. the method that the sLDA model based on extension carries out title annotation to image according to claim 2, it is characterised in that: Step 4 specifically:

It is approximately by formula (5) by following formula

Logp (w | α, β)=L (γ, φ；α,β)+D(q(θ,z|γ,φ)||p(θ,z|w,α,β)) (9)

L is write by formula (10) by using the factorization of p and q:

4. the method that the sLDA model based on extension carries out title annotation to image according to claim 3, it is characterised in that: Step 5 specifically:

Step 5 one utilizes φ_niMaximize the lower bound of L, φ_niIndicate the probability that n-th of visual vocabulary is generated by hiding theme i, ThereforeAnd φ is included by separation_niItem and add Lagrange multiplier appropriate to form Lagrangian:

ψ (x) is double gamma functions；

It calculates about φ_niDerivative:

Wherein β_ivIt indicates for suitable v'sV is v-th of word of dictionary；

Step 5 two utilizes γ_iMaximize the lower bound of L, γ_iIndicate i-th of component part of posteriority Di Li Cray parameter；Include γ_iItem:

To γ_iDerivation:

Enabling derivative is zero:

Iterative equation (16) to (19) is restrained until boundary, and then obtains the variational parameter that the lower bound of L can be made to reach maximum value γ and φ.

5. the method that the sLDA model based on extension carries out title annotation to image according to claim 4, it is characterised in that: Step 6 specifically:

Step 6 one, the formula for acquiring parameter beta are as follows:

Derivation is carried out to obtain

The value of α is sought by Newton iteration method to formula (23)；Wherein M indicates the number of documents of training set；Footmark d indicates d Piece document；

Step 6 three acquires parameter η and σ²Process are as follows:

Wherein μ ()=E_GLM[Y|·]；

To σ²Derivation,Upper assessment

By calculating, parameter estimation result is finally obtained:

6. the method that the sLDA model based on extension carries out title annotation to image according to claim 5, it is characterised in that: Step 7 specifically:

To not have headed new document w as input, and utilize φ_nCome approximate solution conditional probability p (y | w) with q (θ), as follows:

WhereinP (y | w) for inferring the new most probable heading of document w.