CN109635123A

CN109635123A - A kind of Chinese medicine text concept recognition methods of increment type

Info

Publication number: CN109635123A
Application number: CN201811436594.5A
Authority: CN
Inventors: 李建强; 赵金珠
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2019-04-16

Abstract

The invention discloses a kind of Chinese medicine text concept recognition methods of increment type, and on the basis of a small amount of artificial mark corpus, mark corpus is added adjacent to the high forecast sample of sample confidence level with it by choosing forecast sample；Choose forecast sample confidence level it is low and its have higher contribution degree forecast sample return expert mark after be added mark corpus, with this come increase mark sample quantity.Subset ceaselessly Optimized model M is obtained using each iteration, finally obtains the model M after refining.The present invention analyzes Chinese medicine text feature and entity structure, is extracted multiple features to improve the accuracy to the identification of Chinese medicine text concept.Using a small amount of subset, subset is added to realize mark sample incremental training by the high and neighbouring sample confidence level of screening confidence level also high forecast sample, reduces human input.By screening confidence level, the low but return expert of the high forecast sample of contribution degree is marked, to reduce scale and the training time of disaggregated model.

Description

A kind of Chinese medicine text concept recognition methods of increment type

Technical field

The invention belongs to natural language processing field more particularly to a kind of Chinese medicine text concept identification methods of increment type.

Background technique

Traditional traditional Chinese medicine has used thousands of years, has formed more complete knowledge hierarchy, is Chinese traditional culture Important component.In the modern times, prescription and drug data are widely used in clinical medicine.However, thousands of years of Chinese medicine clinical real It tramples the Cheng Fang for accumulating magnanimity and constantly generates the new side of spreading out in clinical practice application, if looked by hand merely It askes, arrange and analyzes, careless mistake inevitably occur, if computer technology can be based on, use state-of-the-art machine learning method centering Doctor's ancient books prescription data set is analyzed, is excavated, and finally obtains effective information in TCM Gynecology prescription, this is for Chinese medical book Succession and using can play an important role.With the continuous development of natural language processing technique and grinding towards Chinese text Study carefully the increase of temperature, the Study of recognition of the medical terminology of Chinese is also gradually increased in recent years.However it is directed to the art of traditional Chinese medical science field The research of language identification is still considerably less.Traditional medical data are the main knowledge resources of traditional Chinese medicine, and be richly stored with clinical warp Knowledge is tested, these experiences are mostly recorded and propagated in the form of document, carry out Entity recognition research, energy to Chinese medicine medical data Enough text knowledges for further excavating Chinese medicine make huge contribution for the integration and innovation of tcm knowledge.

In the identification of Chinese medicine text concept, is extracted using the training corpus manually marked and name entity performance best, but It is to there is the method for supervision excessively to depend on mark corpus, the large-scale artificial labor intensive and meeting of marking is because of expert's subjective consciousness The artificial many noises of increase of difference.And usually will cause error accumulation to sample classification with unsupervised learning entirely, causing property It can decline.The increment type Chinese medicine text concept recognition methods of proposition, by putting into a small amount of seed set, repetitive exercise is to obtain The scale of construction of seed set is gradually developed to certain scale by more seeds, the training pattern M after then training is refined.

Summary of the invention

The deficiency of a large amount of artificial mark corpus is needed to the identification of Chinese medicine text concept for the prior art, the present invention provides one The Chinese medicine text recognition method of kind increment type.Specifically, being to be predicted on the basis of a small amount of artificial mark corpus by choosing Sample and its forecast sample addition mark corpus high adjacent to sample confidence level；Choose forecast sample confidence level it is low and its have compared with Mark corpus is added after returning to expert's mark in the forecast sample of high contribution degree, increases the quantity of mark sample with this.

To achieve the above object, the present invention adopts the following technical scheme that:

A kind of Chinese medicine text concept recognition methods of increment type, comprising the following steps:

Step 1: data prediction is carried out to initial Chinese medicine text data set；

Step 2: multiple features are chosen and CRF template redefines；

Step 3: preparing mark collection, using the training on CRF by mark collection of customized feature templates, obtain initial model M；

Step 4: choosing forecast sample and its preceding k forecast sample addition subset high adjacent to sample confidence level；

Step 5: choosing forecast sample and it is a adjacent to the minimum K of sample confidence level and it has prediction of higher contribution degree Subset is added after returning to expert's mark in sample；

Step 6: obtaining subset ceaselessly Optimized model M using each iteration, finally obtain the model M after refining.

The Chinese medicine text concept recognition methods feature of increment type of the present invention further include:

In the step 1, Chinese medicine text information is that ancient Chinese prose form is mostly single-tone word form, and it is more to have both interchangeability of Chinese characters word, institute To first have to carry out necessary data cleansing, removal " person " " " and some tone etc. be some not to influence what context medicine was expressed Stop words simultaneously to the words of some interchangeability of Chinese characters words and identification mistake (such as the complex form of Chinese characters is identified as "? ") manually corrected.

In the step 2, for terse, the more omissions of Chinese medicine text style of writing, Yan Wen separation, a word is equivalent to a word The structure feature of feature and entity each section has chosen 1) part of speech feature, occurs being at most " n "+" a ", " n " curing mainly field + " v " structure adds part of speech feature as the identification on the boundary of entity and provides clue.2) physical feeling deictic words feature, this feature It whether is the relevant word of physical feeling for marking current, this word often occurs in symptom description.3) context is special Sign, there are correlation, the as feature on the side in CRF model between context, is selected different in the sequence of word composition Length of window will be combined various features, form new feature.

In the step 2, according to feature selecting design feature template.Using the character modules of fixed format % [row, col] Plate, wherein row determines the relative position of described word and current word, and col specifically belongs to described in this feature template for determining Property columns locating in pretreatment corpus.The 0th column indicate that Chinese medicine text itself, the 1st column represent in pretreatment corpus file Part of speech feature, the 2nd column represent physical feeling feature, and the 3rd column represent contextual feature, the i.e. status indication of Chinese medicine text.To every Template, including two class forms is arranged in one column input feature vector t (0~3):

T1=num:%x [index, t], (1)

T2=num:%x [index, t]/%x [index+1, t], (2)

Wherein, num is the number of template, and index is the index (0~2) within the scope of window size, before and after T2 is by feature t Situation is composed.

In the step 3, prepare mark collection, using the training on CRF by mark collection of customized feature templates, obtains just Beginning model M.According to the selection of feature, each character representation label is designed, according to the requirement format that CRF is inputted, design mark collection. Collect selection feature templates window size according to designed mark.

In the step 4, un-annotated data is trained by initial model M, obtains initial predicted sample.Choose prediction Subset is added with its forecast sample adjacent to sample confidence level in sample.

Choosing method is as follows:

One all word of entity E are weighted with the confidence level of the entity, when calculating the confidence level of the entity It is also contemplated that confidence level of the entity adjacent to sample, chooses the high entity forecast sample addition mark sample set of confidence level and carries out weight Training.Assuming that being H^tT obtained classifier of self study iteration, then taking turns in t for certain entity E in unmarked sample pool_i Are as follows:

Wherein, H^t(x_i, tag) and each word that entity is included is represented in t confidence level provided of classifier iteration, H^t (x_n, tag) and represent E_iNeighbouring sample is in t confidence level provided of classifier iteration, θ_iThe power of each label in presentation-entity concept Weight, θ_iBy entity E_iIn the word number that contains determine.By confidence (H^t, E_i) ascending order arrangement sees that K addition marks before choosing Collection.

In the step 5, each iteration of algorithm, the sample in unmarked sample pool can be fewer and fewer, remaining sample It is that classification confidence is lesser, this kind of sample would generally be dropped.These classification confidences are lower and adjacent to sample confidence level phase The biggish sample of difference, and if being divided into same label, mis-marked probability increase adjacent to sample；And if neighbouring sample It is divided into different labels, then a possibility that it is in decision boundary is very big, and two kinds of situations can generate maximum contribution degree to classifier. It obtains the big label of these contribution degrees and returns to expert's mark.Choosing method is as follows:

It is small and differ biggish forecast sample with neighbouring sample confidence level and return to choose classification confidence in unmarked pond Expert's mark.

Compared with the prior art, the beneficial effects of the invention are that:

1, the present invention analyzes Chinese medicine text feature and entity structure, and it is general to Chinese medicine text to improve to be extracted multiple features Read the accuracy of identification.

2, the present invention uses a small amount of subset, passes through the high and neighbouring sample confidence level of screening confidence level also high forecast sample Subset is added to realize mark sample incremental training, reduces human input.

3, the present invention is low by screening confidence level but the return expert of the high forecast sample of contribution degree marks, thus reduction point The scale of class model and training time.

Detailed description of the invention

In conjunction with attached drawing, from the following detailed description to the embodiment of the present invention, it is better understood with the present invention, it is similar in attached drawing Label indicate similar part, in which:

Fig. 1 is the flow chart of increment type concept identification method of the present invention；

Specific embodiment

The specific implementation of increment type Chinese medicine text recognition method of the present invention, which uses, is based on conditional random field models (CRF) fundamental classifier is done, does model training data with 11000 Chinese medicine texts, wherein 4000 are labeled data collection, 7000 are unlabeled data collection.CRF training is carried out with training data, the feature of selection and customized template, and to not Labeled data gives a forecast, and obtains initial predicted sample set, determines that forecast sample is that mark collection is added or returns by confidence level Mark collection is added after artificial mark.

In the present embodiment:

Step 1 carries out data cleansing to Chinese medicine text, and similar text " man, married woman's hectic fever due to yin labor gas, flesh? win thin, four limbs Powerless, the red face of cheek is yellow, dysphoria in chestpalms-soles, and the sleepy heart is alarmed and panicky or there are block, loss of appetite person in more night sweat, abdomen association." in traditional font " body " known Not at "? ", " body " is processed into saying hello；Ending auxiliary words of mood " person " will also be removed.

Step 2 is taken based on the recognition strategy of word in the Entity recognition stage, segments tool and is added certainly using stammerer participle The form for defining dictionary carries entry using stammerer Words partition system for the part of speech feature of selection and carries out part-of-speech tagging, Such as " weakness of limbs " is labeled as " four limbs/n inability/n " by system, wherein n representation noun.Physical feeling feature is used to indicate currently Whether word is physical feeling (Y/N), the display for always simultaneous phenomenon entity occur of physical feeling.On the basis of identifying classification It is marked using " BIESO " method, wherein B (beginning) indicates that the initial character of term, I (intermediate) indicate term Intermediate and E (end) indicates that termination character, S indicate single term character.O (other) indicates other, non-term character.Descriptive word Label selects { B-symptom, I-symptom, E-symptom, S-symptom, B-pattern, I-pattern, E- Pattern, S-pattern, O }, symptom, first word of card type, medium term, ending word and single term character and non-respectively Term word.Then it is marked with the matching realized based on the Forward Maximum Method algorithm of dictionary to word segmentation result.

Step 4: Traditional Chinese Medical Concepts identification prediction is carried out to unlabeled set with initial model M, i.e., to existing observation sequence X_i, i=1,2 ... n finds an optimal flag sequence Y from seven state tags_i, so that conditional probability P (Y | X) it is maximum, I.e. Max (P (Y | X))=P (Y_i|X_i, M), then P (Y_i|X_i, M) i.e. as model M to sequence X_i, the confidence level of prediction.As P (Y_i| X_I,M when) bigger, illustrate model M for observation sequence X_I,Prediction result Y_iIt is more sure, as P (Y_i|X_I,M) smaller When, illustrate model M for observation sequence X_i, prediction result Y_iMore it is not sure.It is real that CRF training selects CRF++ kit Existing, CRF++ tool can be in the confidence level for showing its prediction behind each forecast sample.According to formula

Calculate the confidence level of prediction entity.For example, the fractional prediction result of first time initial training M is as follows:

Word	State tag	Confidence level
			,	O	0.999867
Tripe abdomen	B-symptom	0.959049
			Expansion	E-symptom	0.958962
,	O	0.999978
			Headache	S-symptom	0,804793
It is dizzy	S-symptom	0.431597

Predict the confidence level of entity tripe abdominal distention are as follows: tripe abdominal distention entity contains 2, so

Predict the confidence level of entity headache are as follows: headache entity contains 1 word, so θ=1,

confidence(H¹, headache) and=0.999978+0.431597+1 × 0,804793=2.236368

The confidence level of prediction entity is ranked up, preceding 50 additions mark collection is chosen.

Step 5: sorting to all entities in step 4, chooses 100 minimum samples of confidence level This, is according to formula:

The high expert that returns to of 50 contribution degrees is chosen to mark.

It is understood that be the example in order to illustrate the principle of the present invention and exploitativeness above, the present invention not office It is limited to this.It for those skilled in the art, without departing from the spirit and substance in the present invention, can be with All variations and modifications are made, these variations and modifications are also considered as protection scope of the present invention.

Claims

1. a kind of Chinese medicine text concept recognition methods of increment type, it is characterised in that: include the following steps,

Step 2: multiple features are chosen and CRF template redefines；

Step 5: choosing forecast sample and it is a adjacent to the minimum K of sample confidence level and it has forecast sample of higher contribution degree Subset is added after returning to expert's mark；

2. a kind of Chinese medicine text concept recognition methods of increment type according to claim 1, it is characterised in that: the step In 1, Chinese medicine text information is that ancient Chinese prose form is mostly single-tone word form, and it is more to have both interchangeability of Chinese characters word, so first having to carry out necessary Data cleansing, removal " person " " " and some tone do not influence context medicine expression stop words and meanwhile to some interchangeability of Chinese characters words It is manually corrected with the words of identification mistake.

3. a kind of Chinese medicine text concept recognition methods of increment type according to claim 1, it is characterised in that: the step It in 2, composes a piece of writing terse, more omissions for Chinese medicine text, the characteristics of Yan Wen separation a, word is equivalent to a word and entity each section Structure feature, have chosen 1) part of speech feature, occur most being " n "+" a ", " n "+" v " structure curing mainly field, add word Property feature provides clue for the identification on the boundary of entity；2) physical feeling deictic words feature, this feature are currently for marking No is the relevant word of physical feeling, and this word often occurs in symptom description；3) contextual feature, in word composition In sequence, there are correlation, the as feature on the side in CRF model between context, select different length of window will be to each Kind feature is combined, and forms new feature.

4. a kind of Chinese medicine text concept recognition methods of increment type according to claim 1, it is characterised in that: the step In 2, according to feature selecting design feature template；Using the feature templates of fixed format % [row, col], wherein row determines institute The relative position of descriptor and current word, col is for determining specific object described in this feature template in pretreatment corpus Locating columns；The 0th column indicate that Chinese medicine text itself, the 1st column represent part of speech feature, the 2nd column generation in pretreatment corpus file Table physical feeling feature, the 3rd column represent contextual feature, the i.e. status indication of Chinese medicine text；To each column input feature vector t (0~ 3) template, including two class forms are set:

T1=num:%x [index, t], (1)

T2=num:%x [index, t]/%x [index+1, t], (2)

Wherein, num is the number of template, and index is the index (0~2) within the scope of window size, and T2 is by feature t front-rear position Situation is composed.

5. a kind of Chinese medicine text concept recognition methods of increment type according to claim 1, it is characterised in that: the step In 3, prepare mark collection, using the training on CRF by mark collection of customized feature templates, obtains initial model M；According to feature Selection, design each character representation label, according to CRF input requirement format, design mark collection；According to designed mark Collection selection feature templates window size.

6. a kind of Chinese medicine text concept recognition methods of increment type according to claim 1, it is characterised in that: the step In 4, un-annotated data is trained by initial model M, obtains initial predicted sample；Forecast sample is chosen with it adjacent to sample Subset is added in the forecast sample of confidence level；

Choosing method is as follows:

One all word of entity E are weighted with the confidence level of the entity, is also wanted when calculating the confidence level of the entity Consider that the entity adjacent to the confidence level of sample, is chosen the high entity forecast sample addition mark sample set of confidence level and instructed again Practice；Assuming that being H^tT obtained classifier of self study iteration, then taking turns in t for certain entity E in unmarked sample pool_i Are as follows:

Wherein, H^t(x_i, tag) and each word that entity is included is represented in t confidence level provided of classifier iteration, H^t(x_n, Tag E) is represented_iNeighbouring sample is in t confidence level provided of classifier iteration, θ_iThe weight of each label in presentation-entity concept, θ_iBy entity E_iIn the word number that contains determine；By confidence (H^t, E_i) ascending order arrangement sees that K addition marks and collect before choosing.

7. a kind of Chinese medicine text concept recognition methods of increment type according to claim 1, it is characterised in that: the step In 5, obtains the big label of these contribution degrees and return to expert's mark；Choosing method is as follows:

It is small and differ biggish forecast sample with neighbouring sample confidence level and return to expert to choose classification confidence in unmarked pond Mark.