CN108549639A

CN108549639A - Based on the modified Chinese medicine case name recognition methods of multiple features template and system

Info

Publication number: CN108549639A
Application number: CN201810359240.9A
Authority: CN
Inventors: 袁锋; 陈阳; 陈守强; 赵丽丽; 梁科
Original assignee: Shandong Management University
Current assignee: Shandong Management University
Priority date: 2018-04-20
Filing date: 2018-04-20
Publication date: 2018-09-18

Abstract

The invention discloses based on the modified Chinese medicine case name recognition methods of multiple features template and system, including step：Sentence extraction is carried out to Chinese medicine case text；Classify to the sentence of extraction；Word segmentation processing is carried out to every a kind of sentence；Character feature, part of speech feature, left deictic words feature, right deictic words feature and term characteristics mark are carried out successively to each word that participle obtains；Build training corpus；Formulate feature templates；Obtained language material and feature templates are input in conditional random field models, conditional random field models are trained, obtain trained conditional random field models；Language material to be predicted is built to Chinese medicine case to be predicted；It using the identification language material of structure as input, is input in trained conditional random field models, the Chinese medicine four methods of diagnosis, card type and therapy are finally identified in the position for exporting Chinese medicine case class label and character according to the position of Chinese medicine case class label and character.

Description

Based on the modified Chinese medicine case name recognition methods of multiple features template and system

Technical field

The present invention relates to based on the modified Chinese medicine case name recognition methods of multiple features template and system.

Background technology

The committed step of Chinese medicine case text message is can be named entity to the Chinese medicine case text of magnanimity Identification, and the incidence relation between Chinese medicine four methods of diagnosis four methods of diagnosis text message, card type text message and therapy text message is established, With the thought of this scientific appraisal " diagnosis and treatment based on an overall analysis of the illness and the patient's condition ".Name Entity recognition in numerous areas extensive use, such as：Field of finance and economics, Product identification, microblogging text, military text etc..

Name Entity recognition analysis method is broadly divided into two classes：

The first kind is rule-based and dictionary method and the method based on machine learning.Rule-based and dictionary method The shortcomings that be higher to the dependence of dictionary and rule base, the recognition capability of unregistered word can be greatly reduced；

Second class is the method based on machine learning.Method based on machine learning rapidly and efficiently, has preferable transplanting Property, common method includes mainly Hidden Markov Model (HMM), the hidden horse model (MEMM) of maximum entropy, conditional random field models (CRFs).By comparing, CRFs behaving oneself best in terms of the synthesis such as ease for use, stability and accuracy, in output independence Assuming that and being better than HMM algorithms and MEMM algorithms in terms of the label prejudice problem that is difficult to avoid that.

Invention content

In order to solve the deficiencies in the prior art, the present invention provides known based on the modified Chinese medicine case name of multiple features template Other method and system；

As the first aspect of the present invention：

Recognition methods is named based on the modified Chinese medicine case of multiple features template, is included the following steps：

Step (1)：Sentence extraction is carried out to Chinese medicine case text；

Step (2)：Classify to the sentence of extraction；

Step (3)：Word segmentation processing is carried out to every a kind of sentence；

Step (4)：Each word obtained to participle carries out character feature, part of speech feature, left deictic words feature, the right side successively Deictic words feature and term characteristics mark；The character feature of all words, part of speech feature, left deictic words feature, right deictic words feature And term characteristics form language material observation sequence, according to Chinese medicine case class label and BIO labelling methods to each word into rower Note, generates the output feature of each word；The output feature composition output characteristic sequence of all words；Language material observation sequence and output Characteristic sequence collectively constitutes language material；

Step (5)：Formulate feature templates；

Step (6)：The language material that step (4) obtains and the feature templates that step (5) obtains are input to condition random field In model, conditional random field models are trained, obtain trained conditional random field models；

Step (7)：Chinese medicine case to be predicted is handled using the same method in step (1)-(4), structure waits for pre- Survey language material；It using the identification language material of structure as input, is input in trained conditional random field models, exports Chinese medicine case class The position of distinguishing label and character, finally, according to the position of Chinese medicine case class label and character identify the Chinese medicine four methods of diagnosis, card type and Therapy.

BIO labelling methods refer to：The first character of B presentation-entity, the non-first character of I presentation-entity, O indicate non- Entity character.

As a further improvement on the present invention, the Chinese medicine case text refers to Traditional Chinese Medicine experts diagnosis and treatment activation record.

It is that separator extracts sentence according to the punctuation mark in Chinese medicine case text, the punctuation mark refers to teasing Number, branch or fullstop.

The sentence that the step (1) is extracted, including Chinese medicine four methods of diagnosis sentence, card type sentence and therapy sentence；

The patient's abnormal symptom and abnormal sign that the Chinese medicine four methods of diagnosis, which refer to doctor, to be obtained by the four methods of diagnosis；Card type refers to doctor The raw symptom confirmed；Therapy refers to the therapy that doctor confirms.

As a further improvement on the present invention, classify to sentence according to qualifier, sentence is classified as to current sentence Son, the sentence denied and possible sentence；

The sentence denied and possible sentence are rejected, current sentence is retained.

The qualifier, including it is current, denying or possible.

The current sentence indicates the current malaise symptoms occurred certainly or disease；

The possible sentence indicates the diagnosis made before issuable symptom or doctor are made a definite diagnosis；

The sentence denied indicates the disease or symptom that do not betide sufferers themselves certainly.

As a further improvement on the present invention, remove meaningless word first, modify to wrong word；Point based on word Word method cuts every a kind of sentence, cuts into single word；The meaningless word, including：Number, unit and punctuate symbol Number.

As a further improvement on the present invention, the character feature refers to each word itself；

The part of speech feature, including：Verb, noun, adjective, adverbial word and preposition；

Left deictic words feature refers to the word for appearing in the name entity left side；If it is left that current word occurs from name entity The word on side, then the left deictic words of current word be characterized as T, otherwise, the left deictic words of current word is characterized as F；

Right deictic words feature refers to the word appeared on the right of name entity；If it is right that current word occurs from name entity The word on side, then the right deictic words of current word be characterized as T, otherwise, the right deictic words of current word is characterized as F；

Term characteristics refer to the word for describing human organ；If current word is human organ, the term characteristics of current word For T, otherwise, the term characteristics of current word are F；

As a further improvement on the present invention, the foundation Chinese medicine case class label and BIO labelling methods are to each word It is labeled：Chinese medicine case class label includes Chinese medicine four methods of diagnosis ZS, card type ZX and therapy ZF；

If current word belongs to the Chinese medicine four methods of diagnosis, the Chinese medicine case class label of current word is ZS；If label is ZS's Current word is first character, then the output of current word is characterized as ZS-B；If label is the right and wrong the of the current word of ZS One character, then the output of current word is characterized as ZS-I；

If current word belongs to card type, the Chinese medicine case class label of current word is ZX；If label is the current of ZX Word is first character, then the output of current word is characterized as ZX-B；If label is the right and wrong first of the current word of ZX Character, then the output of current word is characterized as ZX-I；

If current word belongs to therapy, the Chinese medicine case class label of current word is ZF；If label is the current of ZF Word is first character, then the output of current word is characterized as ZF-B；If label is the right and wrong first of the current word of ZF Character, then the output of current word is characterized as ZF-I；

If current word is not belonging to Chinese medicine four methods of diagnosis ZS, card type ZX and therapy ZF and any one of works as, the mark of current word Label are O.

The step of step (5) is：

All words that participle obtains are lined up into sequence；It is about 5 in window size to each word that participle obtains Contextual feature is extracted in text window [- 2,2], by each character representation at the format of " letter+number ", wherein word is by " W " It indicating, part of speech is indicated by " P ", and left deictic words is indicated by " L ", and right deictic words is indicated by " R ", and TCM-related Terms feature is indicated by " Y ", Based on context 19 public characteristic marks are set,

W_-2 indicates preceding second word；First character before W_-1 is indicated；W_0 indicates current word；W_1 indicates the latter word；

Second word after W_2 is indicated；

P_-2 indicates the part of speech of preceding second word；The part of speech of first character before P_-1 is indicated；P_0 indicates the part of speech of current word；

First character part of speech after P_1 is indicated；Second words after P_2 is indicated；

L_-2 indicates the left instruction of preceding second word；L_-1 indicates the left instruction of preceding first word；

The right instruction of first character after R_1 is indicated；The right instruction of second word after R_2 is indicated；

Y_-2 indicates the term characteristics of preceding second word；The term characteristics of first character before Y_-1 is indicated；

Y_0 indicates the term characteristics of current word；The term characteristics of first character after Y_1 is indicated；

The term characteristics of second word after Y_2 is indicated；

It is identified according to public characteristic, it is as follows to formulate feature templates：

W_-2, W_-1, W_0, W_1, W_2, W_-1/W_0, W_0/W_1, W_-2/W_0, W_0/W_2, P_-2, P_-1, P_ 0, P_1, P_2, P_-1/P_0, P_0/P_1, P_-2/P_0, P_0/P_2, L_-2/W_0, L_-1/W_0, W_0/R_1, W_0/R_ 2, Y_-2/W_0, Y_-1/W_0, W_0/Y_0, W_0/Y_1, W_0/Y_2；Wherein, "/" indicates separator.

As a further improvement on the present invention, the step of step (6) are：

Step (6.1)：The language material that step (4) is obtained is as training corpus；

Step (6.2)：By the word character representation in training corpus at observation sequence x, output characteristic sequence is expressed as defeated Go out sequences y, input and output combination (x, y) are saved in training sample set；

Step (6.3)：With training sample set training condition random field models；

Step (6.4)：The conditional probability of prediction combination (x, y) is until convergence, obtains trained conditional random field models.

As the second aspect of the present invention：

Identifying system is named based on the modified Chinese medicine case of multiple features template, including：It memory, processor and is stored in The computer instruction run on memory and on a processor when the computer instruction is run by processor, completes above-mentioned Step described in one method.

As the third aspect of the present invention：

A kind of computer readable storage medium, is stored thereon with computer instruction, and the computer instruction is transported by processor When row, the step described in any of the above-described method is completed.

Compared with prior art, the beneficial effects of the invention are as follows：

Application conditions random field of the present invention proposes a kind of based on the modified Chinese medicine case name entity knowledge of multiple features template Other method proposes character feature, part of speech feature, left and right deictic words feature and term characteristics in conjunction with the characteristics of Chinese medicine case text Mask method, train CRFs models, the identification Chinese medicine four methods of diagnosis, card type and therapy entity to pass through experiment using the data after mark Verification, increase left and right deictic words feature and term characteristics mark after, accuracy rate, recall rate and F estimate had it is larger It improves.By the continuous accumulation of case and more rational parameter setting, and characteristic value, the name are further rationally set Entity recognition method can provide more for structure " the Chinese medicine four methods of diagnosis-card type-therapy " triple correspondence, scientific appraisal diagnosis and treatment Valuable reference and foundation.

Description of the drawings

The accompanying drawings which form a part of this application are used for providing further understanding of the present application, and the application's shows Meaning property embodiment and its explanation do not constitute the improper restriction to the application for explaining the application.

Fig. 1 is the flow chart of the present invention；

Fig. 2 is word frequency figure.

Specific implementation mode

It is noted that following detailed description is all illustrative, it is intended to provide further instruction to the application.Unless another It indicates, all technical and scientific terms that the present invention uses have logical with the application person of an ordinary skill in the technical field The identical meanings understood.

It should be noted that term used herein above is merely to describe specific implementation mode, and be not intended to restricted root According to the illustrative embodiments of the application.As used herein, unless the context clearly indicates otherwise, otherwise singulative It is also intended to include plural form, additionally, it should be understood that, when in the present specification using term "comprising" and/or " packet Include " when, indicate existing characteristics, step, operation, device, component and/or combination thereof.

Chinese medicine case is the movable record of Traditional Chinese Medicine experts diagnosis and treatment, and Entity recognition is named to it and is standardized to Chinese medicine case And informatization research is of great significance.For Chinese medicine case text there is the fuzzy and indefinite feature of appellation is stated, originally Invention is based on condition random field, it is proposed that one kind being based on the modified name entity recognition method of multiple features template.First to Chinese medicine Case text carries out sentence extraction and automatic word segmentation, then refers to the language material progress character feature after participle, part of speech feature, left and right Show that word feature and term characteristics are labeled, finally trains CRFs models, the four methods of diagnosis of identification Chinese medicine, card type using the data after mark And therapy entity, " the Chinese medicine four methods of diagnosis-card type-therapy " triple correspondence is built, reference is provided for scientific appraisal diagnosis and treatment And foundation.With in June, 2014 for selecting in December, 2016 Second Affiliated Hospital of Shandong University of Traditional Chinese Medicine angiocarpy outpatient service expert 12000 parts of Chinese medicine cases be data source, pass through the adjustment of the various combination and contextual window size of feature, further promoted The accuracy of identification.Accuracy rate, recall rate and F estimate average value and have respectively reached 90.68%, 90.45% and 90.56%.

For Chinese medicine case text there is the fuzzy and indefinite situation of appellation is stated, the present invention is based on condition randoms , it proposes a kind of based on the modified name entity recognition method of multiple features template.Chinese medicine case is felt concerned about for 12000, is analyzed Its text feature, part of speech and label feature, definition template and training CRFs models, establish text feature and name entity class and Incidence relation between lexeme excavates the pass of Chinese medicine four methods of diagnosis text message, card type text message and therapy text message by extraction Connection relationship explains " diagnosis and treatment based on an overall analysis of the illness and the patient's condition " principle, and scientific basis is provided for experience succession and knowledge acquisition.

1 case text examples of table

The punctuation mark occurred using in case text carries out subordinate sentence as separator, includes altogether in 12000 parts of cases 180635 sentences remove after number and nonsense word altogether comprising covering 1267 words.Wherein the most persons of number be " heart ", Totally 37236 times, frequency 310.30%；Followed by " chest " totally 20701 times, frequency 172.51%, word of the frequency less than 1% have 727, wherein 50 words such as " boiling ", " lance ", " group ", " curling up ", " section " have only been used 1 time.Word frequency figure is shown in that Fig. 2, frequency are more than 40% shares 28 words, and word frequency table is shown in Table 2.

2. word frequency table of table

Chinese medicine case text belongs to natural language, but traditional Chinese medicine is the multidisciplinary subject interpenetrated, with other natural sections It learns language to compare, Chinese medicine case text has the characteristics such as ambiguousness, Metaphor, classical Chinese and stationarity again.

(1) ambiguity.Such as deep red tongue, it is construed to the tongue picture developed by red tongue, more deepens a step than red tongue, however it is red It can not quantify with deep red defining standard, be typically expressed as " tongue is red " or " tongue is deep red " or directly broadly be recorded as " deep red tongue ".

(2) Metaphor.Such as " the wooden prosperous gram of soil ", it is also " liver qi invading spleen ", is a tcm syndrome caused by liver qi invading spleen. For another example " harmonizing liver-spleen ", then it is specially the therapy for being directed to " the wooden prosperous gram of soil " and setting up to be, i.e., treats " irritability by soothing liver and strengthening spleen The therapy of criminal's spleen " syndrome.

(3) classical Chinese.If " solution expression is evil " is so that perverse trend is gone out from fleshy exterior using drug, " reinforcing earth to generate metal " is to utilize Method nourishing lung qi of the theory invigoration spleen qi of mutual generation of five phases etc..

(4) stationarity.For tongue nature, tongue color, quality are generally described, is expressed as " tongue ×× or tongue nature ×× ", such as " tongue It is red ", " pink tongue ", " tongue is dark ", " tongue nature is light ", " tongue nature is light fat " etc..For tongue fur, then to describe the color of tongue fur, moisturize, Thickness.

One embodiment as the present invention：

As shown in Figure 1, naming recognition methods based on the modified Chinese medicine case of multiple features template, include the following steps：

Step (1)：Sentence extraction is carried out to Chinese medicine case text；

Further, the Chinese medicine case text refers to Traditional Chinese Medicine experts diagnosis and treatment activation record.

Step (2)：Classify to the sentence of extraction；

Further, classify to sentence according to qualifier, sentence be classified as to current sentence, the sentence denied and Possible sentence；The sentence denied and possible sentence are rejected, current sentence is retained.

The qualifier, including it is current, denying or possible.

Further, remove meaningless word first, modify to wrong word；As " blood stasis " is revised as " blood stasis "；Base Every a kind of sentence is cut in the segmenting method of word, cuts into single word；The meaningless word, including：Number, unit And punctuation mark.

Step (4)：Each word obtained to participle carries out character feature, part of speech feature, left deictic words feature, the right side successively Deictic words feature and term characteristics mark；These features have salient feature, automatic marking easy to implement.The word of all words It accords with feature, part of speech feature, left deictic words feature, right deictic words feature and term characteristics and forms language material observation sequence, according in Doctor's case class label and BIO labelling methods are labeled each word, generate the output feature of each word；All words it is defeated Go out feature composition output characteristic sequence；Language material observation sequence and output characteristic sequence collectively constitute language material；Entity is the Chinese medicine four methods of diagnosis (ZS), card type (ZX) and therapy (ZF)；

Further, described that each word is labeled according to Chinese medicine case class label and BIO labelling methods：Chinese medicine is cured Case class label includes Chinese medicine four methods of diagnosis ZS, card type ZX and therapy ZF；

(1) character feature (W)

Word segmentation processing based on word is carried out to case text, such as：" cough runny nose " is divided into " cough/coughing/flows/tears/".

(2) part of speech feature (P)

Part of speech is divided into verb, noun, adjective, adverbial word and preposition etc..Original case language material example is as follows：Experienced before half a month Chill, runny nose of coughing, phlegm is white and sticks, then oedema, by instep, spread to waist abdomen.Uncomfortable in chest, abdominal distension receives difference.Oliguria, it is complete Indigested food.Recently aggravation, perspiration, chilly, limb be cold, palpitaition, it is out of breath, cannot lie down.Result after part-of-speech tagging is as follows： Before half/n months/n/adv senses/v by/v wind/n it is cold/n coughs/v coughs/v streams/v tears/v phlegm/n is white/adj and/con stick/adj after/con and/ Con water/n is swollen/and v opens by/con foot/n the back ofs the body/n/con beginnings/con is climing/v prolongs/v to/v waists/n abdomens/n chests/n is bored/v abdomens/n is swollen/and v receives/v Difference/adj urine/n is few/adj is complete/n paddy/n not /advization/v is close/n day/n diseases/n feelings/n adds/v weight/adv go out/v sweat/fear/tremble with fear/n limbs/ N is cold/heart/n throbs with fear/v gas/n urgency/adv not /adv energy/adv are flat/n is sleeping/v.

(3) left and right deictic words feature (L) and (R)

Chinese medicine name entity often occurs with together with specific word, and certain words for often appearing in the name entity left side are designated as Zuo Zhi circle words, the word for appearing in the right are designated as You Zhi circle words.In terms of the Chinese medicine four methods of diagnosis：Place near Chinese medicine four methods of diagnosis entity Often will appear deictic words has " with ", " further ", " still " etc..In terms of card type：Near card type place often have deictic words " with Cause ", " gesture " etc..In terms of therapy：Often there are deictic words " giving ", " suitable ", " still giving ", " control and give ", " controlling preferably " in place near therapy Deng.

(4) term characteristics (Y)

Chinese medicine case entity includes the word of description human organ, such as：The terms such as " head ", " eye ", " tongue ", " fire ".In Cure the four methods of diagnosis in terms of description human organ and pathologic substances word, such as " head ", " pain ", " eye ", " dry ", " sweat ", " going out ", " urine ", " Huang "；In terms of card type：It is said with the yin-yang and five elements that " gold, wood, water, fire, soil " is representative and " wind, cold, wet, dry, fiery " is to patient Mechanistic description, such as " spleen ", " void ", " wet ", " Sheng ", " heart ", " the moon ", " no ", " foot " etc..In terms of therapy：Usually 4 word Or 8 word patterns, such as " soothing liver-qi stagnation ", " supplementing qi and nourishing yin, promoting blood circulation and removing blood stasis " etc..

By participle and feature automatic marking, language material observation sequence and output characteristic sequence are generated.The language material is indicated with " T " Meet mark feature, " F " indicates that the language material does not meet mark feature and respectively represents the Chinese medicine four methods of diagnosis, card type with ZS, ZX and ZF and control Method entity, as shown in the table.

3 Chinese medicine case class label table of table

It is marked using " BIO " method, first character, the non-first character of " B ", " I " and " O " difference presentation-entity With non-physical character, after upper example class label mark as shown in table 4.

Table 4 " BIO " class label marks

Step (5)：Formulate feature templates；

Further, the step of step (5) are：

Second word after W_2 is indicated；

The term characteristics of second word after Y_2 is indicated；

Further, the step of step (6) are：

Step (6.3)：With training sample set training condition random field models；

Condition random field is defined as follows：It is marked by participle, data cleansing and feature, obtains text input sequence x (x= (x₁,x₂,...,x_n)), model parameter is obtained by training, predicts the conditional probability of the corpus labeling needed combination y.

Assuming that input variable is x, when output variable is y, conditional probability P (y | x) it is defined as following form：

Wherein λ_kFor weight, t_kAnd s_lIt is characterized function, Z (x) is normalization coefficient.

Second embodiment as the present invention：

Third embodiment as the present invention：

1 experiment and its analysis

1.1 evaluation criterion

The index that evaluation information extracts has：Accuracy rate (P), recall rate (R) and F- estimate (F), are defined as follows：

1.2 experimental designs and verification

(1) signature identification

The present invention extracts contextual feature in the contextual window [- 2,2] that window size is 5, and feature space is referred to as Each group of feature templates, are expressed as the format of " letter+number " by " 5 word window ", and wherein word indicates that part of speech is by " P " table by " W " Show, left and right instruction conjunction is indicated that TCM-related Terms feature is indicated by " Y " by " L " and " R ", and based on context setting 19 is public Signature identification, table 5 are public characteristic mark and meaning.

5 signature identification of table and meaning

Serial number	Mark	Meaning	Serial number	Mark	Meaning
						1	W_-2	Preceding second word	2	W_-1	Preceding first character
3	W_0	Current word	4	W_1	First character afterwards
						5	W_2	Second word afterwards	6	P_-2	Preceding second words
7	P_-1	Preceding first character part of speech	8	P_0	Current words
						9	P_1	First character part of speech afterwards	10	P_2	Second words afterwards
11	L_-2	The left instruction of preceding second word	12	L_-1	The left instruction of preceding first word
						13	R_1	The right instruction of first character afterwards	14	R_2	The right instruction of second word afterwards
15	Y_-2	The term characteristics of preceding second word	16	Y_-1	The term characteristics of preceding first character
						17	Y_0	The term characteristics of current word	18	Y_1	The term characteristics of first character afterwards
19	Y_2	The term characteristics of second word afterwards

(2) experimental design

Tmpt_1, Tmpt_2, Tmpt_3 and Tmpt_4 is used to complete three groups of contrast experiments, test feature selection and window respectively For size to the difference of experimental result, template definition is as shown in table 6, and experimental design is as shown in table 7.

6 template of table

7 experimental design of table

4.3 interpretation of result

(1) one analysis of experiment

It is tested, is given in table 8 when contextual window is respectively set as 3 and 5 using Tmpt_1, Tmpt_2, it is right The influence of experimental result.

The variation of 8 window of table influences result

Item name	P (%)	R (%)	F (%)
				The Chinese medicine four methods of diagnosis	+0.45	+0.47	+0.46
Card type	+0.04	+0.38	+0.14
				Therapy	+1.83	+1.12	+1.28

The name entity effect of the Chinese medicine four methods of diagnosis, card type and therapy is different, since the average character length of 3 class entities is：It hopes News asks that diagnosis is 3.17 characters, and card type is 2.21 characters, and therapy is 4.78 characters.And in terms of the raising situation of effect： The F values of the Chinese medicine four methods of diagnosis increase by 0.46%, and the F values of card type increase 0.14%, and therapy increases 1.28%.It is found through experiment real When body length and the close contextual window length of selection, effect is preferable.

(2) two analysis of experiment

Contrast experiment is carried out as experiment mould using Tmpt_2, Tmpt_3, after increasing left and right mark, recognition effect obviously changes Become, the effect promoting of therapy is the most apparent, as a result influences as shown in the table.

Influence of the 9 feature selecting classification logotype of table to result

Item name	P (%)	R (%)	F (%)
				The Chinese medicine four methods of diagnosis	+7.17	+6.23	+0.19
Card type	+5.37	+5.48	+0.42
				Therapy	+5.86	+4.76	+0.84

(3) three analysis of experiment

Experiment three increases an experimental group newly, and template uses Tmpt_4, compared with Tmpt_3, compare increase term characteristics The influence to naming Entity recognition is identified, experimental result is as shown in table 10.

10 optimal identification result of table

It can be obtained by the F values of all kinds of name entities in contrast table 10, best template is Tmpt_4, and accuracy rate is called together The rate of returning and F estimate average value and respectively reach：90.68%, 90.45%, 90.56%, recognition performance is enhanced. In the accuracy rate for improving name Entity recognition, abundant feature set has served critically important.For some special circumstances, also need It is modified by dictionary and rule.

(4) compared with existing method

By consulting literatures, Feng Lizhi proposes that the mixing based on Bootstrapping is known for tcm clinical practice case history language material Method, F values reach 87%；Yuan Yuhu carries out symptom terms name entity extraction using CRFs models and tests, in open test The optimal F values of evaluation result reach 87%；The present invention higher than the two, reaches the average F values of the name Entity recognition of case 90.51%.

The foregoing is merely the preferred embodiments of the application, are not intended to limit this application, for the skill of this field For art personnel, the application can have various modifications and variations.Within the spirit and principles of this application, any made by repair Change, equivalent replacement, improvement etc., should be included within the protection domain of the application.

Claims

1. naming recognition methods based on the modified Chinese medicine case of multiple features template, characterized in that include the following steps：

Step (1)：Sentence extraction is carried out to Chinese medicine case text；

Step (2)：Classify to the sentence of extraction；

Step (4)：Each word obtained to participle carries out character feature, part of speech feature, left deictic words feature, right instruction successively Word feature and term characteristics mark；The character features of all words, part of speech feature, left deictic words feature, right deictic words feature and Term characteristics form language material observation sequence, are labeled to each word according to Chinese medicine case class label and BIO labelling methods, raw At the output feature of each word；The output feature composition output characteristic sequence of all words；Language material observation sequence and output feature Sequence collectively constitutes language material；

Step (5)：Formulate feature templates；

Step (6)：The language material that step (4) obtains and the feature templates that step (5) obtains are input to conditional random field models In, conditional random field models are trained, trained conditional random field models are obtained；

Step (7)：Chinese medicine case to be predicted is handled using the same method in step (1)-(4), builds language to be predicted Material；It using the identification language material of structure as input, is input in trained conditional random field models, exports Chinese medicine case classification mark The position of label and character is finally identified the Chinese medicine four methods of diagnosis, card type according to the position of Chinese medicine case class label and character and is controlled Method.

2. naming recognition methods based on the modified Chinese medicine case of multiple features template as described in claim 1, characterized in that

Be that separator extracts sentence according to the punctuation mark in Chinese medicine case text, the punctuation mark refer to comma, Branch or fullstop；The sentence that the step (1) is extracted, including Chinese medicine four methods of diagnosis sentence, card type sentence and therapy sentence；The Chinese medicine four methods of diagnosis The patient's abnormal symptom and abnormal sign that refer to doctor obtained by the four methods of diagnosis；Card type refers to the symptom that doctor confirms；It controls Method refers to the therapy that doctor confirms.

3. naming recognition methods based on the modified Chinese medicine case of multiple features template as described in claim 1, characterized in that foundation Qualifier classifies to sentence, and sentence is classified as to current sentence, the sentence and possible sentence denied；What rejecting was denied Sentence and possible sentence, retain current sentence；The qualifier, including it is current, denying or possible；It is described current Sentence, indicate the current malaise symptoms occurred certainly or disease；The possible sentence, indicate issuable symptom or The diagnosis that doctor makes before making a definite diagnosis；The sentence denied indicates the disease or symptom that do not betide sufferers themselves certainly.

4. naming recognition methods based on the modified Chinese medicine case of multiple features template as described in claim 1, characterized in that first Remove meaningless word, modifies to wrong word；Every a kind of sentence is cut based on the segmenting method of word, cuts into list A word；The meaningless word, including：Number, unit and punctuation mark.

5. naming recognition methods based on the modified Chinese medicine case of multiple features template as described in claim 1, characterized in that described Character feature refers to each word itself；

Left deictic words feature refers to the word for appearing in the name entity left side；If current word occurs from the name entity left side Word, then the left deictic words of current word be characterized as T, otherwise, the left deictic words of current word is characterized as F；

Right deictic words feature refers to the word appeared on the right of name entity；If current word occurs from the right of name entity Word, then the right deictic words of current word be characterized as T, otherwise, the right deictic words of current word is characterized as F；

Term characteristics refer to the word for describing human organ；If current word is human organ, the term characteristics of current word are T, Otherwise, the term characteristics of current word are F.

6. naming recognition methods based on the modified Chinese medicine case of multiple features template as described in claim 1, characterized in that described Each word is labeled according to Chinese medicine case class label and BIO labelling methods：Chinese medicine case class label includes Chinese medicine four Examine ZS, card type ZX and therapy ZF；

If current word belongs to the Chinese medicine four methods of diagnosis, the Chinese medicine case class label of current word is ZS；If label is the current of ZS Word is first character, then the output of current word is characterized as ZS-B；If label is the right and wrong first of the current word of ZS Character, then the output of current word is characterized as ZS-I；

If current word belongs to card type, the Chinese medicine case class label of current word is ZX；If label is the current word of ZX It is first character, then the output of current word is characterized as ZX-B；If label is that the current word of ZX is non-first character, The output of so current word is characterized as ZX-I；

If current word belongs to therapy, the Chinese medicine case class label of current word is ZF；If label is the current word of ZF It is first character, then the output of current word is characterized as ZF-B；If label is that the current word of ZF is non-first character, The output of so current word is characterized as ZF-I；

If current word is not belonging to Chinese medicine four methods of diagnosis ZS, card type ZX and therapy ZF and any one of works as, the label of current word is O。

7. naming recognition methods based on the modified Chinese medicine case of multiple features template as described in claim 1, characterized in that described The step of step (5) is：

All words that participle obtains are lined up into sequence；To each word that participle obtains, the context window for being 5 in window size Contextual feature is extracted in mouthful [- 2,2], by each character representation at the format of " letter+number ", wherein word is indicated by " W ", Part of speech is indicated that left deictic words is indicated by " L " by " P ", and right deictic words is indicated by " R ", and TCM-related Terms feature is indicated by " Y ", according to Context sets 19 public characteristic marks,

Second word after W_2 is indicated；

The term characteristics of second word after Y_2 is indicated；

W_-2, W_-1, W_0, W_1, W_2, W_-1/W_0, W_0/W_1, W_-2/W_0, W_0/W_2, P_-2, P_-1, P_0, P_ 1, P_2, P_-1/P_0, P_0/P_1, P_-2/P_0, P_0/P_2, L_-2/W_0, L_-1/W_0, W_0/R_1, W_0/R_2, Y_- 2/W_0, Y_-1/W_0, W_0/Y_0, W_0/Y_1, W_0/Y_2；Wherein, "/" indicates separator.

8. naming recognition methods based on the modified Chinese medicine case of multiple features template as described in claim 1, characterized in that step (6) the step of is：

Step (6.2)：By the word character representation in training corpus at observation sequence x, output characteristic sequence is expressed as to export sequence Y is arranged, input and output combination (x, y) are saved in training sample set；

Step (6.3)：With training sample set training condition random field models；

9. naming identifying system based on the modified Chinese medicine case of multiple features template, characterized in that including：Memory, processor with And the computer instruction run on a memory and on a processor is stored, it is complete when the computer instruction is run by processor The step of at the claims 1-8 either method.

10. a kind of computer readable storage medium, characterized in that be stored thereon with computer instruction, the computer instruction quilt Processor run when, complete the claims 1-8 either method the step of.