CN103714178B

CN103714178B - Automatic image marking method based on word correlation

Info

Publication number: CN103714178B
Application number: CN201410008553.1A
Authority: CN
Inventors: 安震
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2014-01-08
Filing date: 2014-01-08
Publication date: 2017-01-25
Anticipated expiration: 2034-01-08
Also published as: CN103714178A

Abstract

The invention discloses an automatic image marking method based on work correlation. A training set T comprises l images, n marking words are marked on each image of the training set T, the training set T is provided with a corresponding vision lemma, and the image to be marked is I. The method includes the steps that a semantic vector of each marking word w is calculated according to a formula, the marking word w is represented by the vector form w=<v1, v2,......, vm>, ci is an associated word in a context, and the associated words in the context total m; semantic similarity of the marking words is calculated according to a formula, and vector module calculated is achieved as is shown in the specification; p(A) is calculated according to the formula, wherein A is a marking word group in w1, w2,......wn, and n is the number of the marking word groups; the conditional probability p(I/wi) is calculated according to a formula; the marking word group A of the image I to be marked is calculated according to the formula A=arg maxAp(I/A) p(A).

Description

A kind of image automatic annotation method of word-based correlation

Technical field

The present invention relates to image processing field, the image automatic annotation method of particularly a kind of word-based correlation.

Background technology

With the high speed development of multimedia and Internet technology, daily life, work are believed to multimedias such as images The dependence of breath is more and more stronger.It is intended to based on the retrieval that semantic image retrieval can not only accurately express user, be also convenient to use The use at family, therefore this retrieval mode not only become a kind of important form of image retrieval, and become what researcher chased Hot technology.

And automatic image annotation technology is an important and challenging job in Semantic Image Retrieval, image is certainly The appearance of dynamic label technology is to automatically obtain the semantic information comprising in image vision content, and it attempts to regard in image bottom Feel and build a bridge block, thus support is made to semantic retrieval in semantic level between feature and high-level semantic.Therefore, it is based on The automatic marking algorithm research of image, semantic has become a very active research branch and crucial skill in field of image search Art, has good application prospect and researching value.

Automatic image annotation is exactly to allow the image automatically to no mark for the computer add the semanteme that can reflect picture material Keyword.It practises semantic concept space and vision spy automatically using the image collection of mark or other obtainable information Levy the relational model in space, and mark the image of unknown semantics with this model.By in the high-layer semantic information of image and bottom Set up a kind of mapping relations between feature, solve the problems, such as semantic gap to a certain extent.

The image automatic annotation method of joint media correlation model is to be currently based in generation model image labeling method to answer With a kind of most commonly used image labeling algorithm, obtain the widely studied of scholar.The basic thought of this marking model is to utilize The method of probability statistics sets up the probability relativity in Image Visual Feature space and semantic concept space, by both statistical learnings Between the joint probability distribution that exists, it is maximum to find out the joint probability that one group of semantic tagger word is allowed to and picture material between, and This group is marked word as the final mark of testing image.

But joint media correlation model belongs to one kind of probabilistic model, this class model mark word high to occurrence frequency has Skewed popularity.Secondly, in joint media correlation model automatic image annotation mark method, different candidate's mark words are in annotation process It is assumed to be it is separate, and the correlation marking between word is not fully excavated.In fact with piece image, no There is multiple association such as symbiosis, level or space between mark word.

Such as one width contains the image of semantic objects such as " sun, sky, cloud, mountain, tree ", regards from image Feel that " sun " can not depart from " sky " this language it can be seen that " sun " and " sky " object has certain spatial correlation in content Adopted object and be individually present；Equally, for " mountain " and " tree " two semantic objects in picture material, The presence of " tree " object is with " mountain " semantic object for vision content background, and the two is same in image vision content There is the contact of interwoveness it is impossible to utterly assume that this two mark words are separate being labeled.Therefore, joint media Correlation model Automatic image annotation algorithm thinks that in annotation process different candidates mark separate way between word and exist Certain defect, may lead to because ignoring correlation between word mark semantic inconsistent phenomenon between word in annotation results.

Content of the invention

In view of this, the invention provides a kind of image automatic annotation method of word-based correlation, combined with overcoming Media correlation model Automatic image annotation algorithm thinks that in annotation process different candidates mark separate way between word The defect existing, solves the problems, such as to lead to because ignoring correlation between word in annotation results semanteme between mark word inconsistent.This The technical scheme that invention proposes is:

A kind of image automatic annotation method of word-based correlation, training set t comprises l image, described l image Constitute image collection p=[p₁p₂… p_l]；Each image labeling of described training set t has n mark word, trains set t In all mark words constitute mark set of words w=[w₁w₂… w_s]；Each image of training set t has corresponding visual word Unit, in training set t, all vision lemmas constitute visual word unit set b=[b₁b₂… b_y], image to be marked is i, the party Method includes:

A. according to formulaCalculate the semantic vector of each of training set t mark word w, word w will be marked It is expressed as vector form w=< v₁,v₂,…,v_m>, wherein, c_iFor context relation word, total m context relation word, p (c_i) For context relation word c_iOverall distribution probability, p (c_i/ w) represent context relation word c_iWith mark word w in training set t Co-occurrence number of times with mark word w training set t in appearance total degree ratio, that is,Wherein, Described context relation word is the mark word in training set t；

B. according to formulaCalculate the semantic similarity between mark word, wherein | | | | it is Vectorial mould calculates, w_i·w_jFor dot product computing；

C. according to formulaCalculate p (a), wherein a is mark phrase { w₁, w₂,…w_n, n is the number of mark phrase；

D. according to formulaDesign conditions Probability p (i/w_i), wherein, p (w_i) For marking word w_iThe ratio of total degree with training set t all marks word in the number of times occurring in training set t, that is,

p(w_i,b₁,…,b_n) computational methods be:

Wherein p (j) represent in image collection p with Machine extracts the probability of a width training image j；p(w_i/ j) represent in training image j that vocabulary w occurs_iPosterior probability；And p (b_k/j) Represent in training image j that vision lemma b occurs_kPosterior probability；

E. basisCalculate p (i/a)；

F. by formula a=arg max_aP (i/a) p (a) calculates the mark phrase a of image i to be marked.

In such scheme, p (w in step d_i/ j) and p (b_k/ j) computational methods be respectively as follows:

p (w_{i} | j) = (1 - α_{j}) \frac{# (w_{i}, j)}{| j |} + α_{j} \frac{# (w_{i}, t)}{| t |} - - - (1)

p (b_{k} | j) = (1 - β_{j}) \frac{# (b_{k}, j)}{| j |} + β_{j} \frac{# (b_{k}, t)}{| t |} - - - (2)

Wherein, α_jWith β_jFor smoothing parameter, it is an empirical value；

#(w_i, j) represent mark word w_iWhether training image j occurs, if it is, # (w_i, j)=1, otherwise # (w_i, J)=0；

#(w_i, t) represent mark word w_iWhether training set t occurs, if it is, # (w_i, t)=1, otherwise # (w_i, T)=0；

#(b_k, j) represent vision lemma b_kWhether training image j occurs, if it is, # (b_k, j)=1, otherwise # (b_k, j)=0；

| j | represents the total number of mark word and vision lemma in training image j；| t | represent training set t in mark word and The total number of vision lemma.

In sum, technical scheme proposed by the present invention combining mark word and the image in joint media correlation model The prior probability that probability calculation process is converted into the probability and mark phrase occurring in mark entry part hypograph two-part is asked Solution, greatly reduces high frequency candidate and marks word for the impact of probability statistics model so that non-high frequency candidate mark word plays more Big effect, improves recall ratio and the precision ratio that non-high frequency candidate marks word, is incorporated into semantic similitude language model simultaneously In the middle of joint media correlation model, remove to estimate the prior probability of one group of mark word by semantic similitude language model, so more have One group of higher mark word of semantic dependency may be produced.Thus improving the entirety mark effect of image.

Brief description

Fig. 1 is the flow chart of the embodiment of the present invention.

Specific embodiment

For making becoming more apparent of the object, technical solutions and advantages of the present invention expression, below in conjunction with the accompanying drawings and specifically The present invention is further described in more detail for embodiment.

The technical scheme is that

p(w_i,b₁,…,b_n) computational methods be:

E. basisCalculate p (i/a)；

Image labeling problem can be defined as at present: a given training set t, this training set t comprise image collection p With mark set of words w, and each image p_iAll complete to mark word mark, the mark word of all images constitutes mark set of words w, From described mark set of words w, how to choose one group of mark word a therein one width new images i is labeled？

The image labeling method of the present invention adopts probabilistic model, and its target seeks to find mark phrase a, its conditional probability P (a/i) maximum it may be assumed that

A=arg max_ap(a/i) (3)

Wherein a is a mark phrase { w₁,w₂,…w_n, image i is with one group of visual signature { b₁,b₂…,b_mRepresent, lead to Cross and image i is pre-processed with (the such as operation such as image segmentation, feature extraction, characteristic value normalization) and image block region Sort out computing to obtain.P (a/i) can be rewritten as following form:

p (a / i) = \frac{p (a, i)}{p (i)} - - - (4)

Because the prior probability of piece image is typically considered, obedience is equally distributed, and therefore p (i) can be regarded as one Individual constant, and

P (a, i)=p (i/a) p (a) (5)

With formula (4), (5), formula (3) is simplified, obtains:

A=arg max_ap(i/a)p(a) (6)

By p (i/a) and two probability of p (a) combine, solve maximum to find optimal mark phrase a.p(i/a) Can obtain from original image marking model, and p (a) can obtain from language model.By giving not to two probability With weight to represent original image model and language model to the final mark influential effect ability obtaining:

a = \arg \max_{a} p {(i / a)}^{λ_{1}} p {(a)}^{λ_{2}} - - - (7)

It is carried out with following form conversion:

A=arg max_a(λ₁log p(i/a)+λ₂log p(a)) (8)

Can be in the hope of going out to mark phrase a as long as calculating p (a) and p (i/a).Wherein, λ₁With λ₂It is in training image collection Determine during machine learning and model foundation, be two constants during carrying out the automatic marking of testing image.

Technical solution of the present invention is illustrated below, image to be marked is taking the training set t comprising l image as a example i.The l image construction image collection p=[p of training set t₁p₂… p_l]；Each image labeling of training set t has n Mark word, in training set t, all mark words constitute mark set of words w=[w₁w₂… w_s]；Each image of training set t There is corresponding vision lemma, in training set t, all vision lemmas constitute visual word unit set b=[b₁b₂… b_y].

Fig. 1 is the flow chart of the present embodiment, as shown in figure 1, comprising the following steps:

Step 101: image i to be marked is carried out with Image semantic classification and segmented areas sort out computing.

In this step, Image semantic classification (image segmentation, feature extraction, characteristic value normalization are carried out to image i to be marked Deng), then carrying out image block region sorts out computing, using clustering algorithm, each image block region is sorted out, is used in combination The combination of visual word unit represents image vision content: i={ i₁i₂… i_f}.The preparation method of vision lemma is prior art, this Place is no longer described in detail.

Step 102: p (a) is calculated by semantic similitude language model.

In order to introduce correlation information between mark word in the similarity between mark word, present invention employs semanteme Vector model represents each mark word w: context relation set of words c=[c₁c₂… c_m], each element c_iRepresent one Individual context relation word, total m context relation word, all marks of mark set of words w in training set t can be chosen Word is as context relation word, i.e. c=w.Each marks word w context relation word vector representation associated there, i.e. w =< v₁,v₂,…,v_m>, the calculating of each of which semanteme component vi is defined as context relation word c_iWith respect to mark word w's Conditional probability and context relation word c_iProbability ratio:

v_{i} = \frac{p (c_{i} / w)}{p (c_{i})} - - - (9)

Wherein p (c_i) represent context relation word c_iOverall distribution probability, for being uniformly distributed.Conditional probability p (c_i/ w) table Show context relation word c_iWith the co-occurrence number of times marking during all image labelings of image collection p in training set t for the word w and mark The ratio of the total degree that note word w occurs in all image labelings of image collection p:

p (c_{i} / w) = \frac{c o u n t (c_{i}, w)}{c o u n t (w)} - - - (10)

p(c_i/ w) represent the intensity distribution of vocabulary w and context relation Term co-occurrence, then divided by each context relation word Overall probability be exactly in order to prevent semantic vector w=< v₁,v₂,…,v_m> led by the high context relation word of the frequency of occurrences Lead, because high-frequency conjunctive word will also tend to very big conditional probability.As shown in table 1, wherein " sky ", " sun ", " clouds ", " town " represent one group of context relation word, and " tree ", " building ", " river " are one group of mark word, mark The semantic vector of note word represents as shown in table 1.

Table 1

	sky	sun	clouds	town
					tree	2.56	0.91	0.74	0.63
building	5.01	0.57	2.41	21.19
					river	2.57	2.57	1.12	5.72

Then the semantic similarity between mark word will be calculated.The calculating of similarity is as shown in Equation 11:

s i m (w_{i}, w_{j}) = \frac{w_{i} \cdot w_{j}}{| | w_{i} | | \cdot | | w_{j} | |} - - - (11)

Wherein | | | | calculate for vectorial mould.

w_i·w_jCalculating as shown in Equation 12:

w_{i} \cdot w_{j} = σ_{k = 1}^{m} v_{w i, k} v_{w_{j}, k} = σ_{k = 1}^{m} \frac{p (c_{k} / w_{i})}{p (c_{k})} \cdot \frac{p (c_{k} / w_{j})}{p (c_{k})} - - - (12)

Wherein c_kRepresent context relation word.Between mark word, semantic similarity is as shown in table 2.Similarity span is 0 to 1, the higher similarity representing between two mark words of numerical value is higher, and the probability that they occur in same piece image is got over Greatly.

Table 2

It is assumed that in same mark, mark vocabulary is semantic related to context relation word, then one group of mark word a= {w₁,w₂,…,w_nProbability p (a) can by calculate each mark word and other mark word between similarity obtain:

p (a) &proportional; \frac{1}{n - 1} \underset{w_{i} &element; a}{σ} \underset{w_{j} &element; a, j &notequal; i}{σ} s i m (w_{i}, w_{j}) - - - (13)

Formula 10,11,12 is updated in formula 13, you can calculate mark phrase Probability p (a):

p (a) &proportional; \frac{1}{n - 1} \underset{w_{i} &element; a}{σ} \underset{w_{j} &element; a, j &notequal; i}{σ} \frac{σ_{k = 1}^{m} \frac{c o u n t (c_{k}, w_{i})}{c o u n t (w_{i}) \cdot p (c_{k})} \cdot \frac{c o u n t (c_{k}, w_{j})}{c o u n t (w_{j}) \cdot p (c_{k})}}{| | w_{i} | | \cdot | | w_{j} | |} - - - (14)

Step 103: p (i/a) is calculated by joint media correlation model.

In this step, first according to formulaDesign conditions Probability p (i/ w_i).Wherein,

p(w_i) computational methods be:

With marking word w_iThe ratio that the number of times occurring in training set t total degree with all mark words represents vocabulary w_iPrior probability p (w_i):

p (w_{i}) = \frac{| w_{i} |}{σ_{w_{k} &element; t} | w_{k} |} - - - (15)

p(w_i,b₁,…,b_n) computational methods be:

p (w_{i}, b_{1}, b_{2}, ..., b_{n}) = \underset{j &element; t}{σ} p (j) p (w_{i} | j) π_{k = 1}^{n} p (b_{k} | j) - - - (16)

P (j) represents the probability randomly drawing a width training image j in image collection p, generally assumes that as being uniformly distributed；p (w_i/ j) represent in training image j that vocabulary w occurs_iPosterior probability；And p (b_k/ j) represent in training image j that vision lemma occurs b_kPosterior probability.The probable value of each is estimated as follows:

p (w_{i} | j) = (1 - α_{j}) \frac{# (w_{i}, j)}{| j |} + α_{j} \frac{# (w_{i}, t)}{| t |} - - - (17)

p (b_{k} | j) = (1 - β_{j}) \frac{# (b_{k}, j)}{| j |} + β_{j} \frac{# (b_{k}, t)}{| t |} - - - (18)

Wherein, α_jWith β_jFor smoothing parameter, it is an empirical value；#(w_i, j) represent mark word w_iTraining image j is No appearance, if it is, # (w_i, j)=1, otherwise # (w_i, j)=0；#(w_i, t) represent mark word w_iIn training set t whether Occur, if it is, # (w_i, t)=1, otherwise # (w_i, t)=0；#(b_k, j) represent vision lemma b_kIn training image j whether Occur, if it is, # (b_k, j)=1, otherwise # (b_k, j)=0；| j | represents the total of mark word and vision lemma in training image j Number；| t | represents the total number of mark word and vision lemma in training set t.

Then p (i/a) can with approximate evaluation go out for

Step 104: calculate the phrase to be marked of image i to be marked.

Below solved p (a) and p (i/a) respectively, according to a=arg max_a(λ₁log p(i/a)+λ₂Log p (a)) can Calculate mark phrase a for piece image i

The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all essences in the present invention Any modification, equivalent substitution and improvement done within god and principle etc., should be included within the scope of protection of the invention.

Claims

1. a kind of image automatic annotation method of word-based correlation is it is characterised in that training set t comprises l image, institute State l image construction image collection p=[p₁p₂… p_l]；Each image labeling of described training set t has n mark word, In training set t, all mark words constitute mark set of words w=[w₁w₂… w_s]；Each image of training set t has accordingly Vision lemma, in training set t, all vision lemmas constitute the first set b=[b of visual word₁b₂… b_y], image to be marked For i, the method includes:

A. according to formulaCalculate the semantic vector of each of training set t mark word w, word w will be marked and represent For vector form w=< v₁,v₂,…,v_m>, wherein, c_iFor context relation word, total m context relation word, p (c_i) it is upper Hereafter conjunctive word c_iOverall distribution probability, p (c_i/ w) represent context relation word c_iWith mark word w being total in training set t The ratio of the total degree that occurrence number is occurred in training set t with mark word w, that is,Wherein, described Context relation word is the mark word in training set t；

B. according to formulaCalculate the semantic similarity between mark word, wherein | | | | for vectorial mould Calculate, w_i·w_jFor dot product computing；

C. according to formulaCalculate p (a), wherein a is mark phrase { w₁,w₂,…w_n, N is the number of mark phrase；

D. according to formulaDesign conditions Probability p (i/w_i), wherein, p (w_i) for marking Note word w_iThe ratio of total degree with training set t all marks word in the number of times occurring in training set t, that is,

p(w_i,b₁,…,b_n) computational methods be:

Wherein p (j) expression is taken out in image collection p at random Take the probability of a width training image j；p(w_i/ j) represent in training image j that vocabulary w occurs_iPosterior probability；And p (b_k/ j) represent Vision lemma b occurs in training image j_kPosterior probability；

E. basisCalculate p (i/a)；

F. by formula a=argmax_aP (i/a) p (a) calculates the mark phrase a of image i to be marked.

2. method according to claim 1 is it is characterised in that p (w in step d_i/ j) and p (b_k/ j) computational methods respectively For:

p (w_{i} | j) = (1 - α_{j}) \frac{# (w_{i}, j)}{| j |} + α_{j} \frac{# (w_{i}, t)}{| t |} - - - (1)

p (b_{k} | j) = (1 - β_{j}) \frac{# (b_{k}, j)}{| j |} + β_{j} \frac{# (b_{k}, t)}{| t |} - - - (2)

Wherein, α_jWith β_jFor smoothing parameter, it is an empirical value；

#(w_i, j) represent mark word w_iWhether training image j occurs, if it is, # (w_i, j)=1, otherwise # (w_i, j)= 0；

#(w_i, t) represent mark word w_iWhether training set t occurs, if it is, # (w_i, t)=1, otherwise # (w_i, t)= 0；

#(b_k, j) represent vision lemma b_kWhether training image j occurs, if it is, # (b_k, j)=1, otherwise # (b_k,j) =0；

| j | represents the total number of mark word and vision lemma in training image j；| t | represents mark word and vision in training set t The total number of lemma.