CN102222239A

CN102222239A - Labelling image scene clustering method based on vision and labelling character related information

Info

Publication number: CN102222239A
Application number: CN2011101487603A
Authority: CN
Inventors: 刘咏梅
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2011-06-03
Filing date: 2011-06-03
Publication date: 2011-10-19
Anticipated expiration: 2031-06-03
Also published as: CN102222239B

Abstract

The invention provides a labelling image scene clustering method based on vision and labelling character related information. The method comprises the following steps of: dividing a training image and a test image respectively by using a NCut (Normalized Cut) image dividing algorithm; constructing a vision nearest-neighbour graph G(C)(V, E) of all images {J1, ., Jl} PCtrain for learning, wherein in a training image set, each image has one group of initial normalized labelling character weight vectors; spreading the labelling character of each training image among the vision nearest neighbours, receiving the accepted images according to the degree of normalized EMD (Earth Mover's Distance) among the accepted images; for each training image, normalizing the accumulated labelling character weights; after the vision characteristics of the image are converted into a group of labelling characters with weights, carrying out the scene semantic clustering by using a PLSA (Probabilistic Latent Semantic Analysis) model; learning each scene semantic vision space by using a Gaussian mixture model; and carrying out the scene classification by using the vision characteristics. With the invention, the coupling precision between the vision characteristics of the image and the labelling character can be increased, and the method can be directly used for the automatic semantic labelling of the image.

Description

Mark image scene clustering method based on vision and mark word relevant information

Technical field

What the present invention relates to is a kind of image processing method.Specifically a kind of method of image to be analyzed being carried out the automatic scene classification.

Background technology

In the image understanding fields such as automatic semantic tagger of image, rely on visual signature that non-mark image is classified, will guarantee that the semantic scene classification of being carried presents consistance on vision distributes.On the one hand, the semantic content that image can be expressed is very abundant, and piece image is placed under the different environment, may present the information of different aspects.On the other hand, because the deficiency of descriptive power, then there is more significantly semantic ambiguity in visual feature of image, and the similar image of vision can't guarantee the consistance of semantic content.

The mark word of image is as the senior semantic content describing mode of a kind of simple and high-efficient image, for the relevance of seeking between image labeling word and the vision content provides a large amount of learning samples reliably.But the intrinsic ambiguousness (as polysemy, many speech synonym) of mark word has also limited the image clustering effect that only relies on image labeling word information.

Summary of the invention

The object of the present invention is to provide a kind of visual feature of image and mark of improving to connect precision between the word, can be directly used in the mark image scene clustering method based on vision and mark word relevant information of the automatic semantic tagger of image.

The object of the present invention is achieved like this:

Step 1 adopts NCut (Normalized Cut) image segmentation algorithm respectively training image and test pattern to be cut apart, and the vision that obtains image-region is described;

Step 2, all images { J that is configured to learn ₁,, J _lThe vision arest neighbors figure of PCtrain

Corresponding each image of vertex set V, the corresponding piece image in each summit, the visible sensation distance between the collection E representative image of limit; Visible sensation distance between image is adopted similarity measure-dozer distance of the integrated coupling of multizone, and (Earth Mover ' s Distance EMD), connects the EMD visible sensation distance between the weights correspondence image on the limit on two summits;

Step 3 is concentrated at training image, and every width of cloth image has one group of initial normalization mark word weight vector;

Step 4 makes the mark word of every width of cloth training image propagate between the vision arest neighbors, and the image of acceptance receives according to the degree of normalized EMD distance between them, and the method for normalizing of EMD distance sees that formula is

Wherein, the EMD distance between the Emd representative image, Emd _NorExpression normalization EMD distance, δ is an empirical parameter, gets the EMD variance of training plan image set;

Step 5, to every width of cloth training image, the mark word weights that accumulation is finished carry out normalization again, and the method for normalizing that initially marks the word weight vector is that statistics all marks the frequency that word occurs in this image;

Step 6 after visual feature of image is converted into one group of mark word that has weights, adopts the PLSA model to carry out the scene Semantic Clustering;

Step 7 is utilized gauss hybrid models (GMM) learning each scene semantic visual space;

Step 8 to test pattern, is utilized visual signature to carry out scene and is sorted out, and directly utilizes the semantic mark word accordingly that obtains of this scene.

In order to alleviate too complicated getting in touch between vision and the mark word, reduce mark word ambiguousness, the regional area that the present invention attempts to set up from image pixel to representative surface material is described, carry out the transition to the mark WD of the senior semantic content of representative image again from the scene semantic classes of image, form a kind of multi-level lower-level vision feature and the connection system between the mark word.To this, the present invention is in the process that this multi-cascade junctor system sets up, how to make full use of mark word and this two aspects information of visual signature of training image, the height semantic consistency that keeps image clustering solves the proportion assignment problem of vision and semantic information in the image clustering process with a kind of natural effective and efficient manner.

In view of probability latent semantic analysis (the Probability Latent Semantic Analysis in the speech bag model (Bag of Words), PLSA) good behaviour aspect automatic semantic classes extraction, we take the PLSA model to extract the scene semanteme of image.But, method of the present invention is with only to rely on visual signature that the mark image set that is used to learn is carried out the PLSA cluster different, because research object is the mark image, except visual signature also comprises mark word information, and the mark word seems particularly important in Semantic Clustering, therefore the present invention's information of combining image vision and two aspects of mark word is carried out PLSA scene cluster, increases the rationality of PLSA model in the scene cluster.

The present invention utilizes the correlativity between visual signature and the mark word, and Image Visual Feature is converted into one group of mark word that has weights, and weights are represented the correlation degree between visual signature and the mark word.The mode of taking a kind of reliability to propagate innovatively, the mark word of every width of cloth image is propagated to its vision adjacent image, the vision difference that information propagation amount is pressed between adjacent image determines that the image of accepting the mark word then carries out message pick-up according to the correlativity between the mark word.Allow mark word cumulative growth in the similar image of visual signature, thereby visual signature is converted into one group of weights representing this image and each mark word degree of correlation.The benefit of this method is, not only determined the proportion assignment problem of mark word and visual information in the cluster process, also solved the sparse property of mark WD, utilizes the PLSA model to extract the scene semantic classes of image with a kind of more natural reasonable manner.

Description of drawings

Fig. 1 is based on the process flow diagram of the mark image scene clustering method of vision and mark word relevant information.

Embodiment

Below in conjunction with the accompanying drawing distance the present invention is done more detailed description:

Step 1 adopts NCut (Normalized Cut) image segmentation algorithm respectively training image (to the mark image that is used to learn) and test pattern to be cut apart, and the vision that obtains image-region is described.

Step 2, all images { J that is configured to learn ₁,, J _lPC _TrainVision arest neighbors figure

Corresponding each image of vertex set V, the corresponding piece image in each summit, the visible sensation distance between the collection E representative image of limit.We adopt similarity measure-dozer distance of the integrated coupling of multizone to the visible sensation distance between image, and (Earth Mover ' s Distance EMD), connects the EMD visible sensation distance between the weights correspondence image on the limit on two summits.

Step 3 is concentrated at training image, and every width of cloth image has one group of initial normalization mark word weight vector.The method for normalizing of initial mark word weight vector is that statistics all marks the frequency that word occurs in this image.

Step 4 makes the mark word of every width of cloth training image propagate between the vision arest neighbors, and the image of acceptance receives according to the degree of normalized EMD distance between them, and the method for normalizing of EMD distance is seen formula (1).

Wherein, the EMD distance between the Emd representative image, Emd _NorExpression normalization EMD distance, δ is an empirical parameter, gets the EMD variance of training plan image set.In the communication process of the mark word of representing semantic classes, every width of cloth image is not passive this mark word of reception, but utilizes the visible sensation distance that receives image and propagate image as marking the confidence level that word is propagated, the scheme of having taked a kind of active to receive.

Step 5, to every width of cloth training image, the mark word weights that accumulation is finished carry out normalization again, promptly divided by and each image between the normalization EMD of (comprising this image itself) apart from sum, specified image and the normalization EMD of self distance are 1.

Step 6 after visual feature of image is converted into one group of mark word that has weights, adopts the PLSA model to carry out the scene Semantic Clustering.At the selection problem of poly-number in this model, this patent has proposed a solution, at first chooses bigger clusters number, and the frequency that appears at the mark word in each cluster result has been given prominence to the semantic emphasis of this scene classification.Judge semantic similarity degree according to the mark word information in the cluster result then, merge with the consistent cluster result of vision distribution, can solve the selection problem of poly-number effectively semantic.

Step 7 is utilized gauss hybrid models (GMM) learning each scene semantic visual space.

Step 8 to test pattern, is utilized visual signature to carry out scene and is sorted out, and can directly utilize the semantic mark word accordingly that obtains of this scene.

Claims

1. mark image scene clustering method based on vision and mark word relevant information is characterized in that:

Step 1 adopts the NCut image segmentation algorithm respectively training image and test pattern to be cut apart, and the vision that obtains image-region is described;

Step 2, all images { J that is configured to learn ₁,, J _l∈ C _TrainVision arest neighbors figure G=(V, E), corresponding each image of vertex set V, the corresponding piece image in each summit, the visible sensation distance between the collection E representative image of limit; It is EMD that visible sensation distance between image is adopted the similarity measure of the integrated coupling of multizone, the EMD visible sensation distance between the weights correspondence image on the limit on two summits of connection;

Step 3 is concentrated at training image, and every width of cloth image has one group of initial normalization mark word weight vector, and the method for normalizing that initially marks the word weight vector is that statistics all marks the frequency that word occurs in this image;

{Emd}_{nor} = e^{- \frac{Emd}{δ}}

Step 5, to every width of cloth training image, the mark word weights that accumulation is finished carry out normalization again;

Step 7 is utilized gauss hybrid models learning each scene semantic visual space;

2. the mark image scene clustering method based on vision and mark word relevant information according to claim 1, it is characterized in that: the described mark word weights that accumulation is finished carry out normalization again, be divided by and each image between, the normalization EMD that comprises this image itself is apart from sum, specified image and the normalization EMD of self distance are 1.

3. the mark image scene clustering method based on vision and mark word relevant information according to claim 1 and 2, it is characterized in that: described employing PLSA model carries out the scene Semantic Clustering, at first choose bigger clusters number, judge semantic similarity degree according to the mark word information in the cluster result then, merge with the consistent cluster result of vision distribution semantic.