CN105843868B

CN105843868B - A kind of case searching method based on language model

Info

Publication number: CN105843868B
Application number: CN201610154543.8A
Authority: CN
Inventors: 张引; 姜利成
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2016-03-17
Filing date: 2016-03-17
Publication date: 2019-03-26
Anticipated expiration: 2036-03-17
Also published as: CN105843868A

Abstract

The invention discloses a kind of case searching method based on language model.Steps are as follows: 1) by OCR, single case of structuring is extracted in text structureization processing from case books；2) Chinese word segmentation tool is used, the pretreatment including segmenting and removing stop words is carried out to all cases；3) the unigram language model of every case is calculated with maximal possibility estimation；4) all cases are directed to, count the number of the horizontal corresponding word of each word frequency, and use the data matched curve counted on；5) carry out the unigram language model of smooth every case using Good-Turing estimation method；6) language model of all case collection, and the unigram language model for correcting single case are established as a whole with all case collection；7) realize that case is searched for using revised language model.The present invention realizes the information retrieval based on language model, establishes respective language model for every case using N-gram, uses the probability of language model generation text as search results ranking foundation.

Description

A kind of case searching method based on language model

Technical field

The present invention relates to information retrieval fields, and in particular to a kind of case searching method based on language model.

Background technique

Language model is a kind of model that text is generated based on probability.Given a word, that is, the sequence of a word Column, this available sequence of language model, i.e. p (w₁,…,w_n) probability.It has very more application scenarios, such as voice Identification, machine translation, part-of-speech tagging (POS tagging), hand-written script identification, information retrieval etc..

N-gram model is that training is fast, calculates and generates the high language model of text probability, is adapted to carry out information retrieval.N- Unigram model, a sentence, that is, a sequence of terms, w are typically in gram₁,…,w_nProbability p (w₁,…, w_n), according to chain rule, it should be equal to p (w₁)×p(w₂|w₁)…p(w_n|w₁,…,w_n-1).If doing a simplest vacation If w₁,…,w_nIt is independent mutually two-by-two, then it can be reduced to p (w₁)×p(w₂)×…×p(w_n).And it is based on this independence It is assumed that obtained language model is exactly unigram model.In information retrieval application, unigram model is needed by flat Sliding processing is come the case where preventing P (term)=0.

In actual operation, it will appear the word being not present in some model dictionaries in training data or test data, this is It is very common because model dictionary be less likely do not need to include all words yet, in n-gram model, word be with vector come It indicates, and the dimension of vector is exactly the size of dictionary, if dictionary is very big, the dimension of term vector will be very high, Model with regard to needing to do more calculating, increases the training time in the training process.And dictionary is bigger can not be to model Effect bring promoted, some uncommon words entire training data concentration perhaps only will appear one twice, and this twice for The estimation of its probability is to be grossly inaccurate, and is all often over-fitting.Just because of this reason, n-gram model is generally required The conditional probability of word is done some smooth

Summary of the invention

Information retrieval frame more common at present is mostly based on TF-IDF, and essence is a kind of keyword for optimizing version Match, can only be according to keyword retrieval, and Chinese medicine case has its unique characteristic of speech sounds.Chinese medicine is with a long history, correspondingly, case Time span is very big, has the case of the writing in classical Chinese and modern text, there is the meaning of the same keyword in the writing in classical Chinese and modern text It differs greatly, the search effect based on keyword is unsatisfactory.For this characteristic of speech sounds of case data, the present invention is realized Information retrieval based on language model establishes respective language model for every case using N-gram, uses language model The probability of text is generated as search results ranking foundation.

In information retrieval, user is usually to construct search phrase according to existing word is likely in interested article Sentence.Information retrieval based on language model is based on this thought, if a search statement probably passes through a text Chapter generates, then this article is just likely to be relevant to the search statement.Usual step is, for every article D, one language model M of training_d, the probability sorting of search statement is then generated according to the language model.

To achieve the above object, the present invention adopts the following technical scheme:

1. a kind of case searching method based on language model, it is characterised in that the following steps are included:

1) by OCR, single case of structuring is extracted in text structureization processing from case books；

2) Chinese word segmentation tool is used, the pretreatment including segmenting and removing stop words is carried out to all cases, and build Vertical dictionary；

3) the unigram language model of every case is calculated with maximal possibility estimation；

4) all cases are directed to, the number N of the horizontal corresponding word of each word frequency is counted_tf, wherein subscript tf represents word frequency level, Using the data matched curve counted on, fitting formula is as follows:

Obtain parameter of curve θ；

5) it is calculated according to the curve of step 4) fittingPass through againDirect estimation goes outRatio；It uses Good-Turing estimation method carrys out the unigram language model of smooth every case, and formula is as follows:

In formula: tf^*Smoothed out word frequency is horizontal；

6) all cases are substituted into single case as training text, repeat step 2) to 5), using all case collection as Entirety establishes the language model of all case collection, corrects single case by weighting addition using the language model Unigram language model, correction formula are as follows:

P_sum(t | d)=ω × P_document(t|d)+(1-ω)×P_corpus(t)

In formula: P_sum(t | d) is the unigram language model of revised single case；ω is weight；P_document(t|d) For the unigram language model of smoothed out every case；P_corpusIt (t) is the language model of entire case collection；

7) realize that case is searched for using revised language model.:

The step 7) searches element and realizes that, based on language model realization, process includes specifically:

7.1) search statement text is pre-processed, and the preprocessing process in the pretreatment and the step 2) is protected It holds unanimously, includes whether stop words, if filtering low word；

7.2) probability that every case generates search statement text is successively calculated, the generating probability of every case is arranged Sequence takes highest several cases to return as a result.

Further, log probability can be used in the probability that described every case generates search statement text, calculates public Formula is as follows:

In formula: M_dFor d cases；The probability for generating search statement texts for d cases is Log probability；p(w_i|M_d) it is the probability that d cases generate i-th of word in search statement text.Quadrature symbology in above formula Quadrature is carried out to the probability of all participles in search statement text.

The Chinese word segmentation tool includes the Jieba of the IKAnalyzer and Python of Java.

The present invention having the beneficial effect that compared with the existing technology

1) degree of correlation of case search result is improved.

2) search statement of language model prediction user input can be used, and play the purpose of simplified user's operation.

Detailed description of the invention

Fig. 1 is the overall flow that language model is established.

Fig. 2 is the realization logic of language model search.

Fig. 3 is the N of case collection Unigram_tfMatched curve and truthful data compare.

Fig. 4 is the language model portion example of a case.

Specific embodiment

The present invention is described in further detail below in conjunction with the drawings and specific embodiments.

Fig. 1 is the process that language model is established, and corresponding once step 1 to 6, Fig. 2 are the realization searched for based on language model Logic, corresponding step 7.

2) using the Chinese word segmentation tool of open source, such as the Jieba of the IKAnalyzer and Python of Java, to all doctors Case carries out the pretreatment including segmenting and removing stop words, and establishes dictionary；

Obtain parameter of curve θ；

In formula: tf^*Smoothed out word frequency is horizontal；

6) all cases are substituted into single case as training text, repeat step 2) to 5), using all case collection as Entirety establishes the language model of entire case collection, corrects single case by weighting addition using the language model Unigram language model, correction formula are as follows:

P_sum(t | d)=ω × P_document(t|d)+(1-ω)×P_corpus(t)

In formula: P_sum(t | d) is the unigram language model of revised single case；ω is weight；P_document(t|d) For the unigram language model of smoothed out every case；P_corpus() is the language model of entire case collection；

7) realize that case is searched for using revised language model, specifically:

7.2) probability that every case generates search statement text is successively calculated, every case generates search statement text Probability is log probability, and calculation formula is as follows:

In formula:The probability that search statement text is generated for d cases is log probability；p(w_i| M_d) it is the probability that d cases generate i-th of word in search statement text.

The generating probability of every case is ranked up, highest several cases is taken to return as a result.

The step of below based on embodiment, being described further to the above method, omitting in embodiment is according to upper The method of stating is realized.

Embodiment

Case title: nourishing Yin and clearing heat dissolving stasis inducing resuscitation method cures child's hemiplegia | patient: Mr. Wang, and female, 2 years old half.| first visit: June 16 nineteen eighty-three.Main suit and medical history (its parent is told): fever in the afternoon, right side disorder of limb's activity, left eye exotropia 40 days. Disease starts from January nineteen eighty-three, continues high fever 1 week, 39~40 DEG C of body temperature, indigestion and loss of appetite, thin, examines through certain hospital according to rabat as right side branch gas Pipe scrofula.Receive the treatment such as streptomysin, rimifon in hospital.Body temperature is gradually recovered normally after 2 months, and rabat is checked Turn.But in high fever again in May nineteen eighty-three, 40 DEG C of body temperature or more, with drowsiness, projectile vomiting, obnubilation, a burst of twitch. Through looking into: the circle such as isocoria, to light Reflection Insensitivity, Xiang Qiang, Kernig's sign is positive, and right side Babinski sign is positive.It chemically examines white thin Cerebrospinal fluid cell number is chemically examined in born of the same parents number 11200/mm-3, neutrality 30%, lymph 69%, acidophilus 1%, Lumbar puncture CSF pressure rise 1150/mm-3, leucocyte 32%, albumen is micro, and sugared five pipes test weakly positive.It examines as tubercular meningitis.It rescues two days, Mind revival, body temperature decline earlier above, but find the right side weak and limp inability of limbs, and right lower extremity stride is dilatory, and walking needs people to assist in, upper right To lift limited on limb, the right hand is stretched in shape difficulty of clenching fist, and left eye sleeps uneasy, night sweat to exotropia, hyponea, glossolalia, night, though still Continue with the treatment such as antituberculotic and " Maitong liquid "；But daily afternoon body temperature still continues at 37~38 DEG C, more than 40 day.

The pretreatment including segmenting and removing stop words is carried out to all cases using the Chinese word segmentation tool of open source, and Dictionary is established, word segmentation result is as follows:

(3) the unigram language model of every case is calculated with maximal possibility estimation.The language model mainly has two A problem will solve, first is that zero-frequency time problem, the word maximal possibility estimation probability not occurred in case is all zero, be unfavorable for realizing Function of search；Second is that single case statistic is too small, the language model of single case need to be corrected with the language model of case collection.

(4) the number N of the horizontal corresponding word of each word frequency is counted according to entire case collection_tf, wherein tf represents word frequency level, example Such as N₁Word frequency is represented in case as 1 word number, uses the data matched curve counted on:

N_tf=(1- θ)^logtfθ

Obtain parameter of curve θ.N_tfMatched curve it is as shown in Figure 3.

(5) carry out the language model of smooth every case using Good-Turing estimation method:

Since single case statistic is too small, maximal possibility estimation E (N is used_tf) inaccuracy, it needs using to according to step (4) curve being fitted replaces estimating.It can be calculated according to the matched curve in (4)Ratio, it is each without estimating From E (N_tf)。

(6) single case causes because of having the words not occurred largely, and the distribution that the word occurred may also be abnormal Its model is not sufficiently stable.Estimated with identical Good-Turing, a language model is established to entire case collection.Due to sample Amount greatly increases, and the model of entire case can be relatively stable.And it can help to differentiate the difference not occurred between word.It is logical Often there are two types of modes to combine different language models: weighting addition (or being interpolation method) and weighted product method.The present invention It is middle to use weighting addition, its advantage is that, the probability obtained after combination is still normalized, i.e., the probability adduction of all words It is still 1.It is as shown in Figure 4 that language model portion is obtained by the case that step (3), (4), (5) are illustrated.

(7) realize that case is searched for using language model.The process for calculating text generation probability is as follows: 1. pairs of texts carry out pre- Processing, and must be identical as pretreatment during N-gram model is established, include whether stop words, if filtering low Word；2. successively calculating the probability that every case generates text.It should be noted that is actually calculated is log probability, need to prevent Only floating number underflow.

Claims

2) Chinese word segmentation tool is used, the pretreatment including segmenting and removing stop words is carried out to all cases, and establish word Allusion quotation；

4) all cases are directed to, the number N of the horizontal corresponding word of each word frequency is counted_tf, wherein subscript tf represents word frequency level, uses The data matched curve counted on, fitting formula are as follows:

Obtain parameter of curve θ；

5) it is calculated according to the curve of step 4) fittingPass through againDirect estimation goes outRatio；Use Good- Turing estimation method carrys out the unigram language model of smooth every case, and formula is as follows:

In formula: tf^*Smoothed out word frequency is horizontal, and E () indicates maximal possibility estimation；

6) all cases are substituted into single case as training text, repeats step 2) to 5), as a whole with all case collection The language model for establishing all case collection corrects single case by weighting addition using the language model Unigram language model, correction formula are as follows:

P_sum(t | d)=ω × P_document(t|d)+(1-ω)×P_corpus(t)

In formula: P_sum(t | d) is the unigram language model of revised single case；ω is weight；P_document(t | d) it is flat The unigram language model of every case after cunning；P_corpusIt (t) is the language model of entire case collection；

7) realize that case is searched for using revised language model.

2. the case searching method based on language model as described in claim 1, it is characterised in that the step 7) is specific Are as follows:

7.1) search statement text is pre-processed, and the preprocessing process in the pretreatment and the step 2) keeps one It causes, includes whether stop words, if filtering low word；

7.2) probability that every case generates search statement text is successively calculated, the generating probability of every case is ranked up, Highest several cases are taken to return as a result.

3. the case searching method based on language model as claimed in claim 2, it is characterised in that every case is raw It is log probability at the probability of search statement text, calculation formula is as follows:

In formula:The probability that search statement text is generated for d cases is log probability；p(w_i|M_d) be D cases generate the probability of i-th of word in search statement text.

4. the case searching method based on language model as described in claim 1, it is characterised in that the Chinese word segmentation work The Jieba of IKAnalyzer and Python of the tool including Java.