CN105843868B - A kind of case searching method based on language model - Google Patents

A kind of case searching method based on language model Download PDF

Info

Publication number
CN105843868B
CN105843868B CN201610154543.8A CN201610154543A CN105843868B CN 105843868 B CN105843868 B CN 105843868B CN 201610154543 A CN201610154543 A CN 201610154543A CN 105843868 B CN105843868 B CN 105843868B
Authority
CN
China
Prior art keywords
case
language model
probability
word
cases
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610154543.8A
Other languages
Chinese (zh)
Other versions
CN105843868A (en
Inventor
张引
姜利成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN201610154543.8A priority Critical patent/CN105843868B/en
Publication of CN105843868A publication Critical patent/CN105843868A/en
Application granted granted Critical
Publication of CN105843868B publication Critical patent/CN105843868B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention discloses a kind of case searching method based on language model.Steps are as follows: 1) by OCR, single case of structuring is extracted in text structureization processing from case books;2) Chinese word segmentation tool is used, the pretreatment including segmenting and removing stop words is carried out to all cases;3) the unigram language model of every case is calculated with maximal possibility estimation;4) all cases are directed to, count the number of the horizontal corresponding word of each word frequency, and use the data matched curve counted on;5) carry out the unigram language model of smooth every case using Good-Turing estimation method;6) language model of all case collection, and the unigram language model for correcting single case are established as a whole with all case collection;7) realize that case is searched for using revised language model.The present invention realizes the information retrieval based on language model, establishes respective language model for every case using N-gram, uses the probability of language model generation text as search results ranking foundation.

Description

A kind of case searching method based on language model
Technical field
The present invention relates to information retrieval fields, and in particular to a kind of case searching method based on language model.
Background technique
Language model is a kind of model that text is generated based on probability.Given a word, that is, the sequence of a word Column, this available sequence of language model, i.e. p (w1,…,wn) probability.It has very more application scenarios, such as voice Identification, machine translation, part-of-speech tagging (POS tagging), hand-written script identification, information retrieval etc..
N-gram model is that training is fast, calculates and generates the high language model of text probability, is adapted to carry out information retrieval.N- Unigram model, a sentence, that is, a sequence of terms, w are typically in gram1,…,wnProbability p (w1,…, wn), according to chain rule, it should be equal to p (w1)×p(w2|w1)…p(wn|w1,…,wn-1).If doing a simplest vacation If w1,…,wnIt is independent mutually two-by-two, then it can be reduced to p (w1)×p(w2)×…×p(wn).And it is based on this independence It is assumed that obtained language model is exactly unigram model.In information retrieval application, unigram model is needed by flat Sliding processing is come the case where preventing P (term)=0.
In actual operation, it will appear the word being not present in some model dictionaries in training data or test data, this is It is very common because model dictionary be less likely do not need to include all words yet, in n-gram model, word be with vector come It indicates, and the dimension of vector is exactly the size of dictionary, if dictionary is very big, the dimension of term vector will be very high, Model with regard to needing to do more calculating, increases the training time in the training process.And dictionary is bigger can not be to model Effect bring promoted, some uncommon words entire training data concentration perhaps only will appear one twice, and this twice for The estimation of its probability is to be grossly inaccurate, and is all often over-fitting.Just because of this reason, n-gram model is generally required The conditional probability of word is done some smooth
Summary of the invention
Information retrieval frame more common at present is mostly based on TF-IDF, and essence is a kind of keyword for optimizing version Match, can only be according to keyword retrieval, and Chinese medicine case has its unique characteristic of speech sounds.Chinese medicine is with a long history, correspondingly, case Time span is very big, has the case of the writing in classical Chinese and modern text, there is the meaning of the same keyword in the writing in classical Chinese and modern text It differs greatly, the search effect based on keyword is unsatisfactory.For this characteristic of speech sounds of case data, the present invention is realized Information retrieval based on language model establishes respective language model for every case using N-gram, uses language model The probability of text is generated as search results ranking foundation.
In information retrieval, user is usually to construct search phrase according to existing word is likely in interested article Sentence.Information retrieval based on language model is based on this thought, if a search statement probably passes through a text Chapter generates, then this article is just likely to be relevant to the search statement.Usual step is, for every article D, one language model M of trainingd, the probability sorting of search statement is then generated according to the language model.
To achieve the above object, the present invention adopts the following technical scheme:
1. a kind of case searching method based on language model, it is characterised in that the following steps are included:
1) by OCR, single case of structuring is extracted in text structureization processing from case books;
2) Chinese word segmentation tool is used, the pretreatment including segmenting and removing stop words is carried out to all cases, and build Vertical dictionary;
3) the unigram language model of every case is calculated with maximal possibility estimation;
4) all cases are directed to, the number N of the horizontal corresponding word of each word frequency is countedtf, wherein subscript tf represents word frequency level, Using the data matched curve counted on, fitting formula is as follows:
Obtain parameter of curve θ;
5) it is calculated according to the curve of step 4) fittingPass through againDirect estimation goes outRatio;It uses Good-Turing estimation method carrys out the unigram language model of smooth every case, and formula is as follows:
In formula: tf*Smoothed out word frequency is horizontal;
6) all cases are substituted into single case as training text, repeat step 2) to 5), using all case collection as Entirety establishes the language model of all case collection, corrects single case by weighting addition using the language model Unigram language model, correction formula are as follows:
Psum(t | d)=ω × Pdocument(t|d)+(1-ω)×Pcorpus(t)
In formula: Psum(t | d) is the unigram language model of revised single case;ω is weight;Pdocument(t|d) For the unigram language model of smoothed out every case;PcorpusIt (t) is the language model of entire case collection;
7) realize that case is searched for using revised language model.:
The step 7) searches element and realizes that, based on language model realization, process includes specifically:
7.1) search statement text is pre-processed, and the preprocessing process in the pretreatment and the step 2) is protected It holds unanimously, includes whether stop words, if filtering low word;
7.2) probability that every case generates search statement text is successively calculated, the generating probability of every case is arranged Sequence takes highest several cases to return as a result.
Further, log probability can be used in the probability that described every case generates search statement text, calculates public Formula is as follows:
In formula: MdFor d cases;The probability for generating search statement texts for d cases is Log probability;p(wi|Md) it is the probability that d cases generate i-th of word in search statement text.Quadrature symbology in above formula Quadrature is carried out to the probability of all participles in search statement text.
The Chinese word segmentation tool includes the Jieba of the IKAnalyzer and Python of Java.
The present invention having the beneficial effect that compared with the existing technology
1) degree of correlation of case search result is improved.
2) search statement of language model prediction user input can be used, and play the purpose of simplified user's operation.
Detailed description of the invention
Fig. 1 is the overall flow that language model is established.
Fig. 2 is the realization logic of language model search.
Fig. 3 is the N of case collection UnigramtfMatched curve and truthful data compare.
Fig. 4 is the language model portion example of a case.
Specific embodiment
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.
Fig. 1 is the process that language model is established, and corresponding once step 1 to 6, Fig. 2 are the realization searched for based on language model Logic, corresponding step 7.
1. a kind of case searching method based on language model, it is characterised in that the following steps are included:
1) by OCR, single case of structuring is extracted in text structureization processing from case books;
2) using the Chinese word segmentation tool of open source, such as the Jieba of the IKAnalyzer and Python of Java, to all doctors Case carries out the pretreatment including segmenting and removing stop words, and establishes dictionary;
3) the unigram language model of every case is calculated with maximal possibility estimation;
4) all cases are directed to, the number N of the horizontal corresponding word of each word frequency is countedtf, wherein subscript tf represents word frequency level, Using the data matched curve counted on, fitting formula is as follows:
Obtain parameter of curve θ;
5) it is calculated according to the curve of step 4) fittingPass through againDirect estimation goes outRatio;It uses Good-Turing estimation method carrys out the unigram language model of smooth every case, and formula is as follows:
In formula: tf*Smoothed out word frequency is horizontal;
6) all cases are substituted into single case as training text, repeat step 2) to 5), using all case collection as Entirety establishes the language model of entire case collection, corrects single case by weighting addition using the language model Unigram language model, correction formula are as follows:
Psum(t | d)=ω × Pdocument(t|d)+(1-ω)×Pcorpus(t)
In formula: Psum(t | d) is the unigram language model of revised single case;ω is weight;Pdocument(t|d) For the unigram language model of smoothed out every case;Pcorpus() is the language model of entire case collection;
7) realize that case is searched for using revised language model, specifically:
7.1) search statement text is pre-processed, and the preprocessing process in the pretreatment and the step 2) is protected It holds unanimously, includes whether stop words, if filtering low word;
7.2) probability that every case generates search statement text is successively calculated, every case generates search statement text Probability is log probability, and calculation formula is as follows:
In formula:The probability that search statement text is generated for d cases is log probability;p(wi| Md) it is the probability that d cases generate i-th of word in search statement text.
The generating probability of every case is ranked up, highest several cases is taken to return as a result.
The step of below based on embodiment, being described further to the above method, omitting in embodiment is according to upper The method of stating is realized.
Embodiment
Case title: nourishing Yin and clearing heat dissolving stasis inducing resuscitation method cures child's hemiplegia | patient: Mr. Wang, and female, 2 years old half.| first visit: June 16 nineteen eighty-three.Main suit and medical history (its parent is told): fever in the afternoon, right side disorder of limb's activity, left eye exotropia 40 days. Disease starts from January nineteen eighty-three, continues high fever 1 week, 39~40 DEG C of body temperature, indigestion and loss of appetite, thin, examines through certain hospital according to rabat as right side branch gas Pipe scrofula.Receive the treatment such as streptomysin, rimifon in hospital.Body temperature is gradually recovered normally after 2 months, and rabat is checked Turn.But in high fever again in May nineteen eighty-three, 40 DEG C of body temperature or more, with drowsiness, projectile vomiting, obnubilation, a burst of twitch. Through looking into: the circle such as isocoria, to light Reflection Insensitivity, Xiang Qiang, Kernig's sign is positive, and right side Babinski sign is positive.It chemically examines white thin Cerebrospinal fluid cell number is chemically examined in born of the same parents number 11200/mm-3, neutrality 30%, lymph 69%, acidophilus 1%, Lumbar puncture CSF pressure rise 1150/mm-3, leucocyte 32%, albumen is micro, and sugared five pipes test weakly positive.It examines as tubercular meningitis.It rescues two days, Mind revival, body temperature decline earlier above, but find the right side weak and limp inability of limbs, and right lower extremity stride is dilatory, and walking needs people to assist in, upper right To lift limited on limb, the right hand is stretched in shape difficulty of clenching fist, and left eye sleeps uneasy, night sweat to exotropia, hyponea, glossolalia, night, though still Continue with the treatment such as antituberculotic and " Maitong liquid ";But daily afternoon body temperature still continues at 37~38 DEG C, more than 40 day.
The pretreatment including segmenting and removing stop words is carried out to all cases using the Chinese word segmentation tool of open source, and Dictionary is established, word segmentation result is as follows:
Case | title | enriching yin | heat-clearing | dissolving stasis | have one's ideas straightened out | method | cure | child | hemiplegia | patient | Mr. Wang | female | 2 | year | half | first visit | 1983 | year | 6 | the moon | 16 | day |.| main suit | and | medical history | (| his father | in generation, tells |) | it is afternoon | fever |, | right side | limb Body | activity | obstacle | left eye | outer | strabismus | 40 | day | disease | start from | 1983 | year | 1 | the moon |, | continue | high fever | 1 | week | body temperature | 39 | 40 | DEG C | it is indigestion and loss of appetite | thin | through certain | hospital | shine | rabat | examine as | right side | bronchus | scrofula | be hospitalized | receive | strepto- Element | thunder rice | envelope | etc. | treatment | 2 | a | the moon | rear | body temperature | gradually | restore normal | rabat | check | improve | but in | 1983 | year | 5 | the moon | again | and high fever | body temperature | 40 | DEG C | more than | with | it is drowsiness | sprayability | vomiting | clouded in mind | it is a burst of | twitch | through looking into | Pupil | etc. | it is big | etc. | circle | it is right | light reflection | it is blunt | Xiang Qiang | gram | Ni Ge | sign | it is positive | right side | Babin this | base sign | it is positive | change Test | leucocyte | number | waist | wear | cerebrospinal fluid | pressure | increase | chemical examination | cerebrospinal fluid | cell | number | leucocyte | 32 | albumen | micro | Sugar five is managed | test | weakly positive | examine as | tubercular meningitis | rescue | treatment | two days | mind | revival | body temperature | earlier above | decline | But | discovery | right side | limbs | weak and limp | it is powerless | right | lower limb | stride | dilatory | walking | need people | assist in | it is right | upper limb | upper to lift | by Though limit | the right hand | be in | clench fist | shape is difficult | stretch | left eye | outward | strabismus | spirit | blunt | speech | unclear | night sleeps | uneasy | night sweat | | Still | continue | with | treating tuberculosis | drug | and | arteries and veins | logical liquid | etc. | treatment | but | it is daily | afternoon | body temperature | still | continue | | 37 |~| 38 | DEG C | | 40 | Yu
(3) the unigram language model of every case is calculated with maximal possibility estimation.The language model mainly has two A problem will solve, first is that zero-frequency time problem, the word maximal possibility estimation probability not occurred in case is all zero, be unfavorable for realizing Function of search;Second is that single case statistic is too small, the language model of single case need to be corrected with the language model of case collection.
(4) the number N of the horizontal corresponding word of each word frequency is counted according to entire case collectiontf, wherein tf represents word frequency level, example Such as N1Word frequency is represented in case as 1 word number, uses the data matched curve counted on:
Ntf=(1- θ)logtfθ
Obtain parameter of curve θ.NtfMatched curve it is as shown in Figure 3.
(5) carry out the language model of smooth every case using Good-Turing estimation method:
Since single case statistic is too small, maximal possibility estimation E (N is usedtf) inaccuracy, it needs using to according to step (4) curve being fitted replaces estimating.It can be calculated according to the matched curve in (4)Ratio, it is each without estimating From E (Ntf)。
(6) single case causes because of having the words not occurred largely, and the distribution that the word occurred may also be abnormal Its model is not sufficiently stable.Estimated with identical Good-Turing, a language model is established to entire case collection.Due to sample Amount greatly increases, and the model of entire case can be relatively stable.And it can help to differentiate the difference not occurred between word.It is logical Often there are two types of modes to combine different language models: weighting addition (or being interpolation method) and weighted product method.The present invention It is middle to use weighting addition, its advantage is that, the probability obtained after combination is still normalized, i.e., the probability adduction of all words It is still 1.It is as shown in Figure 4 that language model portion is obtained by the case that step (3), (4), (5) are illustrated.
(7) realize that case is searched for using language model.The process for calculating text generation probability is as follows: 1. pairs of texts carry out pre- Processing, and must be identical as pretreatment during N-gram model is established, include whether stop words, if filtering low Word;2. successively calculating the probability that every case generates text.It should be noted that is actually calculated is log probability, need to prevent Only floating number underflow.

Claims (4)

1. a kind of case searching method based on language model, it is characterised in that the following steps are included:
1) by OCR, single case of structuring is extracted in text structureization processing from case books;
2) Chinese word segmentation tool is used, the pretreatment including segmenting and removing stop words is carried out to all cases, and establish word Allusion quotation;
3) the unigram language model of every case is calculated with maximal possibility estimation;
4) all cases are directed to, the number N of the horizontal corresponding word of each word frequency is countedtf, wherein subscript tf represents word frequency level, uses The data matched curve counted on, fitting formula are as follows:
Obtain parameter of curve θ;
5) it is calculated according to the curve of step 4) fittingPass through againDirect estimation goes outRatio;Use Good- Turing estimation method carrys out the unigram language model of smooth every case, and formula is as follows:
In formula: tf*Smoothed out word frequency is horizontal, and E () indicates maximal possibility estimation;
6) all cases are substituted into single case as training text, repeats step 2) to 5), as a whole with all case collection The language model for establishing all case collection corrects single case by weighting addition using the language model Unigram language model, correction formula are as follows:
Psum(t | d)=ω × Pdocument(t|d)+(1-ω)×Pcorpus(t)
In formula: Psum(t | d) is the unigram language model of revised single case;ω is weight;Pdocument(t | d) it is flat The unigram language model of every case after cunning;PcorpusIt (t) is the language model of entire case collection;
7) realize that case is searched for using revised language model.
2. the case searching method based on language model as described in claim 1, it is characterised in that the step 7) is specific Are as follows:
7.1) search statement text is pre-processed, and the preprocessing process in the pretreatment and the step 2) keeps one It causes, includes whether stop words, if filtering low word;
7.2) probability that every case generates search statement text is successively calculated, the generating probability of every case is ranked up, Highest several cases are taken to return as a result.
3. the case searching method based on language model as claimed in claim 2, it is characterised in that every case is raw It is log probability at the probability of search statement text, calculation formula is as follows:
In formula:The probability that search statement text is generated for d cases is log probability;p(wi|Md) be D cases generate the probability of i-th of word in search statement text.
4. the case searching method based on language model as described in claim 1, it is characterised in that the Chinese word segmentation work The Jieba of IKAnalyzer and Python of the tool including Java.
CN201610154543.8A 2016-03-17 2016-03-17 A kind of case searching method based on language model Active CN105843868B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610154543.8A CN105843868B (en) 2016-03-17 2016-03-17 A kind of case searching method based on language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610154543.8A CN105843868B (en) 2016-03-17 2016-03-17 A kind of case searching method based on language model

Publications (2)

Publication Number Publication Date
CN105843868A CN105843868A (en) 2016-08-10
CN105843868B true CN105843868B (en) 2019-03-26

Family

ID=56587237

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610154543.8A Active CN105843868B (en) 2016-03-17 2016-03-17 A kind of case searching method based on language model

Country Status (1)

Country Link
CN (1) CN105843868B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106709520B (en) * 2016-12-23 2019-05-31 浙江大学 A kind of case classification method based on topic model
CN108172304B (en) * 2017-12-18 2021-04-02 广州七乐康药业连锁有限公司 Medical information visualization processing method and system based on user medical feedback
CN109192299A (en) * 2018-08-13 2019-01-11 中国科学院计算技术研究所 A kind of medical analysis auxiliary system based on convolutional neural networks
CN109299357B (en) * 2018-08-31 2022-04-12 昆明理工大学 Laos language text subject classification method

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052682A (en) * 1997-05-02 2000-04-18 Bbn Corporation Method of and apparatus for recognizing and labeling instances of name classes in textual environments
CN101470701A (en) * 2007-12-29 2009-07-01 日电(中国)有限公司 Text analyzer supporting semantic rule based on finite state machine and method thereof
US20130173610A1 (en) * 2011-12-29 2013-07-04 Microsoft Corporation Extracting Search-Focused Key N-Grams and/or Phrases for Relevance Rankings in Searches
CN103187052B (en) * 2011-12-29 2015-09-02 北京百度网讯科技有限公司 A kind of method and device setting up the language model being used for speech recognition
CN104376842A (en) * 2013-08-12 2015-02-25 清华大学 Neural network language model training method and device and voice recognition method

Also Published As

Publication number Publication date
CN105843868A (en) 2016-08-10

Similar Documents

Publication Publication Date Title
US20230101445A1 (en) Semantic Classification of Numerical Data in Natural Language Context Based on Machine Learning
Liu et al. Drug-drug interaction extraction via convolutional neural networks
Saloot et al. Hadith data mining and classification: a comparative analysis
Henriksson et al. Identifying adverse drug event information in clinical notes with distributional semantic representations of context
Zhao et al. Disease named entity recognition from biomedical literature using a novel convolutional neural network
CN105843868B (en) A kind of case searching method based on language model
Asada et al. Extracting drug-drug interactions with attention CNNs
Dessi et al. TF-IDF vs word embeddings for morbidity identification in clinical notes: An initial study
Newman-Griffis et al. Embedding transfer for low-resource medical named entity recognition: a case study on patient mobility
Yao et al. Traditional Chinese medicine clinical records classification using knowledge-powered document embedding
Wang et al. Named entity recognition in Chinese medical literature using pretraining models
Tang et al. Terminology-aware medical dialogue generation
Pylieva et al. Improving automatic categorization of technical vs. laymen medical words using fasttext word embeddings
Zhuang et al. Affective event classification with discourse-enhanced self-training
Vincent et al. Using deep learning to improve phenotyping from clinical reports
Bensemann et al. Eye gaze and self-attention: How humans and transformers attend words in sentences
CN116092699A (en) Cancer question-answer interaction method based on pre-training model
Liu et al. Extracting patient demographics and personal medical information from online health forums
Yu et al. An intent classification method for questions in" Treatise on Febrile diseases" based on TinyBERT-CNN fusion model
Kim et al. Similarity-based unsupervised spelling correction using BioWordVec: development and usability study of bacterial culture and antimicrobial susceptibility reports
Pape-Haugaard Clinical concept normalization on medical records using word embeddings and heuristics
Ebadi et al. Interpretable self-supervised multi-task learning for covid-19 information retrieval and extraction
Grabar et al. Year 2021: COVID-19, Information Extraction and BERTization among the Hottest Topics in Medical Natural Language Processing
Pang et al. YNU-HPCC at ROCLING 2023 MultiNER-Health Task: A transformer-based approach for Chinese healthcare NER
Li et al. Retrieving and ranking short medical questions with two stages neural matching model

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant