Background technology
With becoming increasingly popular for the equipment capable of taking pictures such as the development of digital audio-effect processing and digital camera, various images
Being skyrocketed through for geometry level is presented in quantity.And the fast-developing of internet also causes image is propagated to become more to accelerate with shared simultaneously
It is prompt.In order to effectively organize, inquire about and browse such large-scale image resource, image retrieval technologies are arisen at the historic moment, and turn into meter
The research emphasis of calculation machine visual field.
Existing image search method is broadly divided into two kinds:CBIR(Content-Based Image
Retrieval)With text based image retrieval(Text-Based Image Retrieval).CBIR
Need user's offer piece image to be used as to inquire about, the bottom visual signature of system extraction image, such as color, texture and shape,
Vision index is established for image, occurrence is then found out according to the visual similarity between image in database and inquiry, realizes inspection
The purpose of rope.Due to inconsistency, i.e., so-called " semantic gap between image bottom visual signature and high level semantic-concept be present
(Semantic Gap)", the performance of CBIR is unsatisfactory.Text based image retrieval, it is necessary to
Text index is established in advance to image, system is according to the correlation of text as inquiry as long as submitting text during user search
Returned with similar image is found out, so retrieval to image translates into the retrieval to text key word.
Compared with CBIR, text based image retrieval only needs user to submit text key word,
It is convenient and swift, more favored by users, thus also turn into the major way of main flow commercialization image search engine.But this
Kind mode needs to establish image text index, that is, realizes the semantic tagger of image, and this is text based image retrieval
A job of great challenge in technology.Realize the semantic tagger of image, it has also become the weight of text based image retrieval technologies
In it is weight.A kind of traditional mode is manually to be marked, but this mode time and effort consuming, especially in face of large-scale net
During network image, it obviously can not be competent at.Therefore, how to break away from manual intervention, and quickly and efficiently realize to image from
Dynamic semantic tagger, becomes particularly significant.
In order to realize that the automation of image marks, a kind of existing method of prior art is that image is classified, then
Mark the result of classification as image.Specifically, each semantic key words are regarded as a category label(Label),
And based on some graders of training, then classified with these graders to not marking image, institute is sub-category to be
The mark of the image.Existing many ripe sorting algorithms at present, such as SVMs, stealthy Markov model etc..
However, although image labeling, dependent on the accuracy of sorting algorithm, current classification are carried out using the method for classification
Although algorithm accuracy is higher, but still has certain error.In addition, existing sorting algorithm is binary classification mostly
Device, such as SVMs, then for there is the image of multiple mark, it is necessary to design multiple graders, and carried out to image
More subseries, efficiency is not also high.
Therefore, it is necessary to labeling system and the method for a kind of biologic medical image based on LDA are provided to meet existing need
Ask.
The content of the invention
It is an object of the invention to provide a kind of accuracy is high, mark of the conveniently biologic medical image based on LDA
System and method.
Therefore, the invention provides a kind of labeling system of the biologic medical image based on LDA, including LDA training modules,
Key words extraction module, descriptor refining module, index context sentence module, context generation module, mark generation module,
The LDA training modules are used to be trained LDA models;The key words extraction module is used for every width biologic medical image
Comment carry out LDA modelings, then extract all descriptor from institute's established model;The descriptor refines module to institute
Theme set of words caused by key words extraction module is stated to optimize;The index context sentence module is used to cure from biology
Treat the sentence collection indexed out in the text of image with theme word association;The context generation module is from each descriptor institute
Corresponding sentence, which is concentrated, chooses a most close sentence, then gathers all most close sentences, forms biologic medical image
Context;The LDA models that the mark generation module is obtained by LDA training modules enter to the context of biologic medical image
Row modeling, theme distribution and the word distribution of biologic medical image are obtained, each word is general during then theme-word is distributed
Rate is multiplied by the probability of corresponding theme, weights of the acquired results as this word, according still further to the order of weights from big to small by institute
There is word rank, mark word of several words as biologic medical image before selection.
It is preferred that the data set of the LDA models is the comment of all biologic medical images, from every width biologic medical
The comment of node is extracted in text corresponding to image, the comment set of all images is constituted into LDA moulds
The training dataset of type.
It is preferred that the training module is trained using the Gibbs method of samplings to LDA models, each list of first sampling out
The distribution of theme corresponding to word, document-theme distribution is then extrapolated according to this distribution and theme-word is distributed.
It is preferred that the descriptor refining module includes to the optimization process of theme set of words:Biology is cured in LDA models
In the result of comment modeling for treating image, if the probability of some subject word is zero in theme-word distribution, by the list
Word is rejected from theme set of words;If not including some descriptor in the comment of biologic medical image, by the word from
Descriptor, which is concentrated, to be rejected;If the word repeated is rejected, only retains one containing the word repeated in theme set of words.
It is preferred that the index context sentence module utilizes LUCENE gophers to each in theme set of words
Word, as querying condition, retrieve all sentences for including the descriptor.
Include it is preferred that the most close sentence chooses process:Traversal includes each sentence of one of descriptor,
If this sentence contains other descriptor, its number of votes obtained just accordingly increases, and a descriptor contributes a ticket, then chooses
Most close sentence of the number of votes obtained highest sentence as this descriptor;The most close sentence for gathering all descriptor is formed up and down
Text.
Invention also provides a kind of mask method of the biologic medical image based on LDA, comprise the following steps:Step
1:A part of biologic medical image construction training set is chosen, and is extracted from the text of every width biologic medical image in node
Comment, form LDA models training dataset;Step 2:LDA models are trained, first sampled out corresponding to word
The distribution of theme, then further calculate document-theme distribution and theme-word distribution;Step 3:It is secondary to one not mark image,
It is modeled using the LDA models of training, chooses all descriptor, forms theme set of words;Step 4:To theme set of words
Optimize, remove the word wherein repeated, the word and the not word in comment that probability is zero, so as to obtain refining master
Epigraph set;Step 5:To a descriptor, all sentences comprising the word are retrieved from the text of image, are formed
One sentence collection, it is denoted as the corresponding sentence collection of the descriptor;Step 6:Concentrate selection most close from the corresponding sentence of each descriptor
The sentence cut, form the context of the image;Step 7:Context is modeled with the LDA models of training, then by theme-
The probability of each word in word distribution is multiplied by the probability of corresponding theme, weights of the obtained result as word;In descending order
Sort all words, several final marks as image before selection.
Compared with prior art, the present invention takes full advantage of the comment and text in data set associated by image
To excavate the mark word of image, accuracy is high, and can once generate multiple mark words.Realize the standard of biologic medical image
Really after mark, the image of correlation can be searched using keyword index, it is convenient and swift, more meet people's text retrieval custom.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained all other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
As described above, the present invention is labeled for biologic medical image, and in biologic medical image corpus, every figure
As there is a corresponding text.With reference to this particularity, it is proposed that one kind is based on LDA(Latent Dirichlet
Allocation, latent Dirichletal location)Biologic medical image mask method, the comment using LDA from image
(caption)Middle extraction descriptor, context then is extracted in the text according to corresponding to these descriptor from image, finally
LDA is recycled to be modeled context, the resulting descriptor just final mark as biologic medical image.
Specifically, with reference to figure 1, the invention provides a kind of labeling system of the biologic medical image based on LDA, including
LDA training modules, key words extraction module, descriptor refining module, index context sentence module, context generation module,
Mark generation module.
LDA training modules are used to be trained LDA models;LDA models are by certain training data set pair LDA moulds
Type is trained, and is distributed with generating document-theme distribution and theme-word.The data set of LDA models is all lifes in the present invention
The comment of thing medical image.From the text corresponding to every width biologic medical image(XML format)Middle extraction caption
The comment of the content of node, the i.e. image, the comment of all images gather together, and constitute the instruction of LDA models
Practice data set.We set the Dirichlet prior parameter of theme number, document-theme distribution and theme-word distribution simultaneously
It is set to empirical value.LDA training modules are trained using the Gibbs method of samplings to LDA models, and each word institute of first sampling out is right
The distribution of theme is answered, document-theme distribution is then extrapolated according to this distribution and theme-word is distributed.
Key words extraction module is used to carry out LDA modelings to the comment of every width biologic medical image, then from being built
Model(Theme distribution and word distribution)It is middle to extract all descriptor;Image is not marked for a pair, utilizes training module institute
Comment of the caused LDA models to the image(caption)It is modeled, then from the result of modeling(Theme distribution and
Word is distributed)The middle descriptor for extracting all words as the image, is added in theme set of words.
Descriptor refining module optimizes to theme set of words caused by the key words extraction module, obtains most smart
Simple, maximally effective theme set of words.Comment in LDA models to image(caption)In the result of modeling, if main
The probability of some subject word is zero in topic-word distribution, and the word is rejected from theme set of words;If the explanation of image
Do not include some descriptor in word, the word is concentrated from descriptor and rejected;If contain the list repeated in theme set of words
Word, the word repeated is rejected, only retains one.By these optimization operations, so as to the theme set of words more refined.It is logical
Descriptor refining treatment is crossed, removes the descriptor of repetition, while removes the descriptor that probability is zero in LDA modeling results, and
Remove picture specification word(caption)In the sentence that does not include
Index context sentence module is used to index out from the text of biologic medical image and theme word association
Sentence collection;Index module is by the use of LUCENE as gopher, to each word in refining theme set of words, as
Querying condition, retrieve all sentences for including the descriptor.After the completion of Index process, for each descriptor, there is one
Individual sentence collection is associated.It is to be appreciated that in index context sentence module, the embodiment of the present invention is come real using LUCENE
Existing text retrieval, also has other text retrieval instruments, can realize same function instead of LUCECE at present.
Sentence corresponding to context generation module from each descriptor, which is concentrated, chooses a most close sentence, Ran Houji
All most close sentences are closed, form the context of biologic medical image(context), i.e., all molecular set of sentence closely
It is exactly context.Include it is preferred that the most close sentence chooses process:Traversal includes each sentence of one of descriptor
Son, if this sentence contains other descriptor, its number of votes obtained just accordingly increases, and a descriptor contributes a ticket, then
Choose most close sentence of the number of votes obtained highest sentence as this descriptor;The most close sentence for gathering all descriptor is formed
Context.
The LDA models that mark generation module is obtained by LDA training modules are built to the context of biologic medical image
Mould, theme distribution and the word distribution of biologic medical image are obtained, then multiplies the probability of each word in theme-word distribution
With the probability of corresponding theme, weights of the acquired results as this word, according still further to the order of weights from big to small by all lists
Word sorts, mark word of several words as biologic medical image before selection.
With reference to figure 2, correspondingly, invention also provides a kind of mask method of the biologic medical image based on LDA, bag
Include following steps:
Step S01:A part of biologic medical image construction training set is chosen, and it is literary from the text of every width biologic medical image
The comment in node is extracted in part, forms the training dataset of LDA models;
Step S02:LDA models are trained, the distribution for theme corresponding to word of first sampling out, then further calculated
Document-theme distribution and theme-word distribution;
Step S03:It is secondary to one not mark image, it is modeled using the LDA models of training, chooses all descriptor,
Form theme set of words;
Step S04:Theme set of words is optimized, the word wherein repeated, the word that probability is zero is removed and does not exist
Word in comment, so as to obtain refining theme set of words;
Step S05:To a descriptor, all sentences comprising the word, group are retrieved from the text of image
Into a sentence collection, the corresponding sentence collection of the descriptor is denoted as;
Step S06:Concentrated from the corresponding sentence of each descriptor and choose most close sentence, form the upper and lower of the image
Text;
Step S07:Context is modeled with the LDA models of training, then by each list in theme-word distribution
The probability of word is multiplied by the probability of corresponding theme, weights of the obtained result as word;Sort all words in descending order, before selection
Several final marks as image.
Coordinate with reference to figure 3, the specific behaviour as the biologic medical image labeling method based on LDA of one embodiment of the invention
It is as follows to make step:
1st step, start
2nd step, a part of biologic medical image construction training set is chosen, and extracted from the text of each image
Comment in CAPTION nodes, form the training dataset of LDA models;Meanwhile given number of topics, document-theme distribution
Study first, the Study first of theme-word distribution.
3rd step, LDA models are trained using Gibbs sampling algorithms;First sample out theme corresponding to word point
Cloth, then further calculate document-theme distribution and theme-word distribution.
4th step, it is secondary to one not mark image, it is modeled using the LDA models of training, chooses all descriptor, group
Into theme set of words.
5th step, theme set of words is optimized, remove the word wherein repeated, the word that probability is zero and do not saying
Word in plaintext word, so as to obtain refining theme set of words.
6th step, to a descriptor, all sentences comprising the word are retrieved from the text of image with LUCECE
Son, a sentence collection is formed, be denoted as the corresponding sentence collection of the descriptor.
7th step, there is corresponding sentence collection if all of descriptor, then into the 8th step, otherwise into the 6th step.
8th step, using context generating algorithm, concentrated from the corresponding sentence of each descriptor and choose most close sentence,
Form the context of the image.
9th step, the LDA models trained with the 3rd step are modeled to context, then will be every in theme-word distribution
The probability of individual word is multiplied by the probability of corresponding theme, weights of the obtained result as word;Sort all words in descending order, choosing
Several final marks as image before taking.
10th step, all images that do not mark are all marked, and into the 11st step, otherwise jump to the 4th step.
11st step, terminate.
Compared with prior art, the present invention takes full advantage of the comment of biologic medical image and corresponding text envelope
Breath, the descriptor of image, and the text message traced back to where image are excavated from comment, one section of context is generated, enters
And extract the mark word of image.This mode substantially increases the accuracy of mark, and can disposably generate image and be closed
Multiple marks of connection.The present invention takes full advantage of comment in data set associated by image and text to excavate image
Mark word, accuracy is high, and can once generate multiple mark words.After the accurate mark for realizing biologic medical image,
The image of correlation can be searched using keyword index, it is convenient and swift, more meet people's text retrieval custom.
The labeling system and method for the biologic medical image based on LDA provided above the embodiment of the present invention, carry out
It is discussed in detail, specific case is applied in the present invention principle and embodiment of the present invention are set forth, the above is implemented
The explanation of example is only intended to help the method and its core concept for understanding the present invention;Meanwhile for the general technology people of this area
Member, according to the thought of the present invention, there will be changes in specific embodiments and applications, in summary, this explanation
Book content should not be construed as limiting the invention.