CN103942274B

CN103942274B - A kind of labeling system and method for the biologic medical image based on LDA

Info

Publication number: CN103942274B
Application number: CN201410120529.7A
Authority: CN
Inventors: 徐颂华; 林谋广; 姜涛; 薛凯军; 肖剑
Original assignee: Sun Yat Sen University; Institute of Dongguan of Sun Yat Sen University
Current assignee: Sun Yat Sen University; Institute of Dongguan of Sun Yat Sen University
Priority date: 2014-03-27
Filing date: 2014-03-27
Publication date: 2017-11-14
Anticipated expiration: 2034-03-27
Also published as: CN103942274A

Abstract

The invention discloses a kind of labeling system of the biologic medical image based on LDA, including LDA training modules, key words extraction module, descriptor refining module, index context sentence module, context generation module, mark generation module, LDA training modules are trained to LDA models；Key words extraction module carries out LDA modelings to the comment of image and extracts descriptor；Descriptor refining module optimizes to theme set of words；Index context sentence module index goes out the sentence collection with theme word association；Context generation module chooses the context of most close sentence pie graph picture；Mark generation module is modeled to the context of image, and several words are used as the mark word of biologic medical image before being chosen by calculating.The present invention discloses a kind of mask method of the biologic medical image based on LDA.The present invention can once generate multiple mark words, and accuracy is high, and associated picture is searched using keyword index, convenient and swift, more meet people's text retrieval custom.

Description

A kind of labeling system and method for the biologic medical image based on LDA

Technical field

The present invention relates to technical field of image processing, and in particular to a kind of mark system of the biologic medical image based on LDA System and method.

Background technology

With becoming increasingly popular for the equipment capable of taking pictures such as the development of digital audio-effect processing and digital camera, various images Being skyrocketed through for geometry level is presented in quantity.And the fast-developing of internet also causes image is propagated to become more to accelerate with shared simultaneously It is prompt.In order to effectively organize, inquire about and browse such large-scale image resource, image retrieval technologies are arisen at the historic moment, and turn into meter The research emphasis of calculation machine visual field.

Existing image search method is broadly divided into two kinds：CBIR（Content-Based Image Retrieval）With text based image retrieval（Text-Based Image Retrieval）.CBIR Need user's offer piece image to be used as to inquire about, the bottom visual signature of system extraction image, such as color, texture and shape, Vision index is established for image, occurrence is then found out according to the visual similarity between image in database and inquiry, realizes inspection The purpose of rope.Due to inconsistency, i.e., so-called " semantic gap between image bottom visual signature and high level semantic-concept be present （Semantic Gap）", the performance of CBIR is unsatisfactory.Text based image retrieval, it is necessary to Text index is established in advance to image, system is according to the correlation of text as inquiry as long as submitting text during user search Returned with similar image is found out, so retrieval to image translates into the retrieval to text key word.

Compared with CBIR, text based image retrieval only needs user to submit text key word, It is convenient and swift, more favored by users, thus also turn into the major way of main flow commercialization image search engine.But this Kind mode needs to establish image text index, that is, realizes the semantic tagger of image, and this is text based image retrieval A job of great challenge in technology.Realize the semantic tagger of image, it has also become the weight of text based image retrieval technologies In it is weight.A kind of traditional mode is manually to be marked, but this mode time and effort consuming, especially in face of large-scale net During network image, it obviously can not be competent at.Therefore, how to break away from manual intervention, and quickly and efficiently realize to image from Dynamic semantic tagger, becomes particularly significant.

In order to realize that the automation of image marks, a kind of existing method of prior art is that image is classified, then Mark the result of classification as image.Specifically, each semantic key words are regarded as a category label（Label）, And based on some graders of training, then classified with these graders to not marking image, institute is sub-category to be The mark of the image.Existing many ripe sorting algorithms at present, such as SVMs, stealthy Markov model etc..

However, although image labeling, dependent on the accuracy of sorting algorithm, current classification are carried out using the method for classification Although algorithm accuracy is higher, but still has certain error.In addition, existing sorting algorithm is binary classification mostly Device, such as SVMs, then for there is the image of multiple mark, it is necessary to design multiple graders, and carried out to image More subseries, efficiency is not also high.

Therefore, it is necessary to labeling system and the method for a kind of biologic medical image based on LDA are provided to meet existing need Ask.

The content of the invention

It is an object of the invention to provide a kind of accuracy is high, mark of the conveniently biologic medical image based on LDA System and method.

Therefore, the invention provides a kind of labeling system of the biologic medical image based on LDA, including LDA training modules, Key words extraction module, descriptor refining module, index context sentence module, context generation module, mark generation module, The LDA training modules are used to be trained LDA models；The key words extraction module is used for every width biologic medical image Comment carry out LDA modelings, then extract all descriptor from institute's established model；The descriptor refines module to institute Theme set of words caused by key words extraction module is stated to optimize；The index context sentence module is used to cure from biology Treat the sentence collection indexed out in the text of image with theme word association；The context generation module is from each descriptor institute Corresponding sentence, which is concentrated, chooses a most close sentence, then gathers all most close sentences, forms biologic medical image Context；The LDA models that the mark generation module is obtained by LDA training modules enter to the context of biologic medical image Row modeling, theme distribution and the word distribution of biologic medical image are obtained, each word is general during then theme-word is distributed Rate is multiplied by the probability of corresponding theme, weights of the acquired results as this word, according still further to the order of weights from big to small by institute There is word rank, mark word of several words as biologic medical image before selection.

It is preferred that the data set of the LDA models is the comment of all biologic medical images, from every width biologic medical The comment of node is extracted in text corresponding to image, the comment set of all images is constituted into LDA moulds The training dataset of type.

It is preferred that the training module is trained using the Gibbs method of samplings to LDA models, each list of first sampling out The distribution of theme corresponding to word, document-theme distribution is then extrapolated according to this distribution and theme-word is distributed.

It is preferred that the descriptor refining module includes to the optimization process of theme set of words：Biology is cured in LDA models In the result of comment modeling for treating image, if the probability of some subject word is zero in theme-word distribution, by the list Word is rejected from theme set of words；If not including some descriptor in the comment of biologic medical image, by the word from Descriptor, which is concentrated, to be rejected；If the word repeated is rejected, only retains one containing the word repeated in theme set of words.

It is preferred that the index context sentence module utilizes LUCENE gophers to each in theme set of words Word, as querying condition, retrieve all sentences for including the descriptor.

Include it is preferred that the most close sentence chooses process：Traversal includes each sentence of one of descriptor, If this sentence contains other descriptor, its number of votes obtained just accordingly increases, and a descriptor contributes a ticket, then chooses Most close sentence of the number of votes obtained highest sentence as this descriptor；The most close sentence for gathering all descriptor is formed up and down Text.

Invention also provides a kind of mask method of the biologic medical image based on LDA, comprise the following steps：Step 1：A part of biologic medical image construction training set is chosen, and is extracted from the text of every width biologic medical image in node Comment, form LDA models training dataset；Step 2：LDA models are trained, first sampled out corresponding to word The distribution of theme, then further calculate document-theme distribution and theme-word distribution；Step 3：It is secondary to one not mark image, It is modeled using the LDA models of training, chooses all descriptor, forms theme set of words；Step 4：To theme set of words Optimize, remove the word wherein repeated, the word and the not word in comment that probability is zero, so as to obtain refining master Epigraph set；Step 5：To a descriptor, all sentences comprising the word are retrieved from the text of image, are formed One sentence collection, it is denoted as the corresponding sentence collection of the descriptor；Step 6：Concentrate selection most close from the corresponding sentence of each descriptor The sentence cut, form the context of the image；Step 7：Context is modeled with the LDA models of training, then by theme- The probability of each word in word distribution is multiplied by the probability of corresponding theme, weights of the obtained result as word；In descending order Sort all words, several final marks as image before selection.

Compared with prior art, the present invention takes full advantage of the comment and text in data set associated by image To excavate the mark word of image, accuracy is high, and can once generate multiple mark words.Realize the standard of biologic medical image Really after mark, the image of correlation can be searched using keyword index, it is convenient and swift, more meet people's text retrieval custom.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is the structural representation of the labeling system of the biologic medical image of the invention based on LDA；

Fig. 2 is the flow chart of the mask method of the biologic medical image of the invention based on LDA；

Fig. 3 is the flow chart of the mask method of the biologic medical image based on LDA of the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained all other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

As described above, the present invention is labeled for biologic medical image, and in biologic medical image corpus, every figure As there is a corresponding text.With reference to this particularity, it is proposed that one kind is based on LDA（Latent Dirichlet Allocation, latent Dirichletal location）Biologic medical image mask method, the comment using LDA from image （caption）Middle extraction descriptor, context then is extracted in the text according to corresponding to these descriptor from image, finally LDA is recycled to be modeled context, the resulting descriptor just final mark as biologic medical image.

Specifically, with reference to figure 1, the invention provides a kind of labeling system of the biologic medical image based on LDA, including LDA training modules, key words extraction module, descriptor refining module, index context sentence module, context generation module, Mark generation module.

LDA training modules are used to be trained LDA models；LDA models are by certain training data set pair LDA moulds Type is trained, and is distributed with generating document-theme distribution and theme-word.The data set of LDA models is all lifes in the present invention The comment of thing medical image.From the text corresponding to every width biologic medical image（XML format）Middle extraction caption The comment of the content of node, the i.e. image, the comment of all images gather together, and constitute the instruction of LDA models Practice data set.We set the Dirichlet prior parameter of theme number, document-theme distribution and theme-word distribution simultaneously It is set to empirical value.LDA training modules are trained using the Gibbs method of samplings to LDA models, and each word institute of first sampling out is right The distribution of theme is answered, document-theme distribution is then extrapolated according to this distribution and theme-word is distributed.

Key words extraction module is used to carry out LDA modelings to the comment of every width biologic medical image, then from being built Model（Theme distribution and word distribution）It is middle to extract all descriptor；Image is not marked for a pair, utilizes training module institute Comment of the caused LDA models to the image（caption）It is modeled, then from the result of modeling（Theme distribution and Word is distributed）The middle descriptor for extracting all words as the image, is added in theme set of words.

Descriptor refining module optimizes to theme set of words caused by the key words extraction module, obtains most smart Simple, maximally effective theme set of words.Comment in LDA models to image（caption）In the result of modeling, if main The probability of some subject word is zero in topic-word distribution, and the word is rejected from theme set of words；If the explanation of image Do not include some descriptor in word, the word is concentrated from descriptor and rejected；If contain the list repeated in theme set of words Word, the word repeated is rejected, only retains one.By these optimization operations, so as to the theme set of words more refined.It is logical Descriptor refining treatment is crossed, removes the descriptor of repetition, while removes the descriptor that probability is zero in LDA modeling results, and Remove picture specification word（caption）In the sentence that does not include

Index context sentence module is used to index out from the text of biologic medical image and theme word association Sentence collection；Index module is by the use of LUCENE as gopher, to each word in refining theme set of words, as Querying condition, retrieve all sentences for including the descriptor.After the completion of Index process, for each descriptor, there is one Individual sentence collection is associated.It is to be appreciated that in index context sentence module, the embodiment of the present invention is come real using LUCENE Existing text retrieval, also has other text retrieval instruments, can realize same function instead of LUCECE at present.

Sentence corresponding to context generation module from each descriptor, which is concentrated, chooses a most close sentence, Ran Houji All most close sentences are closed, form the context of biologic medical image（context）, i.e., all molecular set of sentence closely It is exactly context.Include it is preferred that the most close sentence chooses process：Traversal includes each sentence of one of descriptor Son, if this sentence contains other descriptor, its number of votes obtained just accordingly increases, and a descriptor contributes a ticket, then Choose most close sentence of the number of votes obtained highest sentence as this descriptor；The most close sentence for gathering all descriptor is formed Context.

The LDA models that mark generation module is obtained by LDA training modules are built to the context of biologic medical image Mould, theme distribution and the word distribution of biologic medical image are obtained, then multiplies the probability of each word in theme-word distribution With the probability of corresponding theme, weights of the acquired results as this word, according still further to the order of weights from big to small by all lists Word sorts, mark word of several words as biologic medical image before selection.

With reference to figure 2, correspondingly, invention also provides a kind of mask method of the biologic medical image based on LDA, bag Include following steps：

Step S01：A part of biologic medical image construction training set is chosen, and it is literary from the text of every width biologic medical image The comment in node is extracted in part, forms the training dataset of LDA models；

Step S02：LDA models are trained, the distribution for theme corresponding to word of first sampling out, then further calculated Document-theme distribution and theme-word distribution；

Step S03：It is secondary to one not mark image, it is modeled using the LDA models of training, chooses all descriptor, Form theme set of words；

Step S04：Theme set of words is optimized, the word wherein repeated, the word that probability is zero is removed and does not exist Word in comment, so as to obtain refining theme set of words；

Step S05：To a descriptor, all sentences comprising the word, group are retrieved from the text of image Into a sentence collection, the corresponding sentence collection of the descriptor is denoted as；

Step S06：Concentrated from the corresponding sentence of each descriptor and choose most close sentence, form the upper and lower of the image Text；

Step S07：Context is modeled with the LDA models of training, then by each list in theme-word distribution The probability of word is multiplied by the probability of corresponding theme, weights of the obtained result as word；Sort all words in descending order, before selection Several final marks as image.

Coordinate with reference to figure 3, the specific behaviour as the biologic medical image labeling method based on LDA of one embodiment of the invention It is as follows to make step:

1st step, start

2nd step, a part of biologic medical image construction training set is chosen, and extracted from the text of each image Comment in CAPTION nodes, form the training dataset of LDA models；Meanwhile given number of topics, document-theme distribution Study first, the Study first of theme-word distribution.

3rd step, LDA models are trained using Gibbs sampling algorithms；First sample out theme corresponding to word point Cloth, then further calculate document-theme distribution and theme-word distribution.

4th step, it is secondary to one not mark image, it is modeled using the LDA models of training, chooses all descriptor, group Into theme set of words.

5th step, theme set of words is optimized, remove the word wherein repeated, the word that probability is zero and do not saying Word in plaintext word, so as to obtain refining theme set of words.

6th step, to a descriptor, all sentences comprising the word are retrieved from the text of image with LUCECE Son, a sentence collection is formed, be denoted as the corresponding sentence collection of the descriptor.

7th step, there is corresponding sentence collection if all of descriptor, then into the 8th step, otherwise into the 6th step.

8th step, using context generating algorithm, concentrated from the corresponding sentence of each descriptor and choose most close sentence, Form the context of the image.

9th step, the LDA models trained with the 3rd step are modeled to context, then will be every in theme-word distribution The probability of individual word is multiplied by the probability of corresponding theme, weights of the obtained result as word；Sort all words in descending order, choosing Several final marks as image before taking.

10th step, all images that do not mark are all marked, and into the 11st step, otherwise jump to the 4th step.

11st step, terminate.

Compared with prior art, the present invention takes full advantage of the comment of biologic medical image and corresponding text envelope Breath, the descriptor of image, and the text message traced back to where image are excavated from comment, one section of context is generated, enters And extract the mark word of image.This mode substantially increases the accuracy of mark, and can disposably generate image and be closed Multiple marks of connection.The present invention takes full advantage of comment in data set associated by image and text to excavate image Mark word, accuracy is high, and can once generate multiple mark words.After the accurate mark for realizing biologic medical image, The image of correlation can be searched using keyword index, it is convenient and swift, more meet people's text retrieval custom.

The labeling system and method for the biologic medical image based on LDA provided above the embodiment of the present invention, carry out It is discussed in detail, specific case is applied in the present invention principle and embodiment of the present invention are set forth, the above is implemented The explanation of example is only intended to help the method and its core concept for understanding the present invention；Meanwhile for the general technology people of this area Member, according to the thought of the present invention, there will be changes in specific embodiments and applications, in summary, this explanation Book content should not be construed as limiting the invention.

Claims

1. a kind of labeling system of the biologic medical image based on LDA, it is characterised in that taken out including LDA training modules, descriptor Modulus block, descriptor refining module, index context sentence module, context generation module, mark generation module, the LDA Training module is used to be trained LDA models；The key words extraction module is used for the explanation to every width biologic medical image Word carries out LDA modelings, then extracts all descriptor from institute's established model；The descriptor refines module to the theme Theme set of words optimizes caused by word abstraction module；The index context sentence module is used for from biologic medical image Text in index out sentence collection with theme word association；The context generation module is from corresponding to each descriptor Sentence, which is concentrated, chooses a most close sentence, then gathers all most close sentences, forms the upper and lower of biologic medical image Text；The LDA models that the mark generation module is obtained by LDA training modules are built to the context of biologic medical image Mould, theme distribution and the word distribution of biologic medical image are obtained, then multiplies the probability of each word in theme-word distribution With the probability of corresponding theme, weights of the acquired results as this word, according still further to the order of weights from big to small by all lists Word sorts, mark word of several words as biologic medical image before selection；Wherein, the data set of the LDA models is all The comment of biologic medical image, the expository writing of node is extracted from the text corresponding to every width biologic medical image Word, the comment set of all images is constituted into the training dataset of LDA models.

2. the labeling system of the biologic medical image based on LDA as claimed in claim 1, it is characterised in that the training mould Block is trained using the Gibbs method of samplings to LDA models, the distribution for theme corresponding to each word of first sampling out, Ran Hougen Document-theme distribution and theme-word distribution are extrapolated according to this distribution.

3. the labeling system of the biologic medical image based on LDA as claimed in claim 1, it is characterised in that the descriptor Refining module includes to the optimization process of theme set of words：In the knot that LDA models model to the comment of biologic medical image In fruit, if the probability of some subject word is zero in theme-word distribution, the word is rejected from theme set of words；Such as Do not include some descriptor in the comment of fruit biologic medical image, the word is concentrated from descriptor and rejected；If theme Containing the word repeated in set of words, the word repeated is rejected, only retains one.

4. the labeling system of the biologic medical image based on LDA as claimed in claim 1, it is characterised in that on the index Hereafter sentence module, as querying condition, is examined using LUCENE gophers to each word in theme set of words Rope goes out all sentences for including the descriptor.

5. the labeling system of the biologic medical image based on LDA as claimed in claim 1, it is characterised in that described most close Sentence choose process include：Traversal includes each sentence of one of descriptor, if this sentence contains other masters Epigraph, its number of votes obtained just accordingly increase, and a descriptor contributes a ticket, then choose number of votes obtained highest sentence and are used as this The most close sentence of descriptor；The most close sentence for gathering all descriptor forms context.

6. a kind of mask method of the biologic medical image based on LDA, it is characterised in that comprise the following steps：

Step 1：A part of biologic medical image construction training set is chosen, and is carried from the text of every width biologic medical image The comment in node is taken, forms the training dataset of LDA models；

Step 2：LDA models are trained, the distribution for theme corresponding to word of first sampling out, then further calculate document- Theme distribution and theme-word distribution；

Step 3：It is secondary to one not mark image, it is modeled using the LDA models of training, chooses all descriptor, composition master Epigraph set；

Step 4：Theme set of words is optimized, removes the word wherein repeated, the word that probability is zero and not in expository writing Word in word, so as to obtain refining theme set of words；

Step 5：To a descriptor, all sentences comprising the word are retrieved from the text of image, form one Sentence collection, it is denoted as the corresponding sentence collection of the descriptor；

Step 6：Concentrated from the corresponding sentence of each descriptor and choose most close sentence, form the context of the image；

Step 7：Context is modeled with the LDA models of training, then by the general of each word in theme-word distribution Rate is multiplied by the probability of corresponding theme, weights of the obtained result as word；Sort all words in descending order, several works before selection For the final mark of image.