CN110298040A

CN110298040A - A kind of pair of Chinese corpus is labeled the control method and control device of identification

Info

Publication number: CN110298040A
Application number: CN201910539314.1A
Authority: CN
Inventors: 嵇艺; 徐盛; 罗震
Original assignee: Xiamen Yifangjian Information Technology Co Ltd; Yijian (shanghai) Information Technology Co Ltd
Current assignee: Xiamen Yifangjian Information Technology Co Ltd; Yijian (shanghai) Information Technology Co Ltd
Priority date: 2019-06-20
Filing date: 2019-06-20
Publication date: 2019-10-01

Abstract

The invention discloses a kind of control methods that identification is labeled to Chinese corpus, it identifies Chinese corpus by corpus model, include the following steps: that a. determines that vector quantization Chinese corpus, the word vector dictionary are determined based on conventional Chinese corpus and medicine corpus based on word vector dictionary and Chinese corpus；B. vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus；C. the progress maximum likelihood deduction of timing Chinese corpus is determined and marks Chinese corpus, the markup information of the mark Chinese corpus, which includes at least, simplifies mark and compound mark.The present invention is using simply, and it is convenient to operate, and powerful, practical, learning ability is powerful, identifies efficiently and accurately, has high commercial value.

Description

A kind of pair of Chinese corpus is labeled the control method and control device of identification

Technical field

The invention belongs to computer application in medicine fields, and in particular, to a kind of to be labeled identification to Chinese corpus Control method and control device.

Background technique

The medicine entity mark of fusion medical knowledge is the steps necessary of medical information computer analysis, can be used for medical treatment and faces The scenes such as bed auxiliary diagnosis, hospital guide, quality of case history control, medical report analysis and medical data analysis, the field of application Widely, it for example, it can be applied to HIS system, identifies the case history of natural language and assists diagnosis；It can be applied to network It registers and hospital guide, identifies the information of patient's readme；It can be applied to case history quality control system, identify natural language part and analyze in it Contain information.

In existing technology, according to medicine common sense, the labels such as position, orientation, symptom, drug is marked off and are formulated Mark rule.Every sentence in the corpus of collection is labeled according to above-mentioned rule, then by the data benefit after mark It is final to obtain the computer software model that identify medicine entity with the method training of machine learning.

However, such technical solution have the defects that it is very more, 1, the design of label it is too general, for example only devise disease The excessively wide label of the covering scopes such as disease, symptom, treatment, such design too idealize, once negative word occur, repairing in Chinese The content " segmentation " of certain class label is opened in excuse etc., i.e., for example, such as: abdomen is slightly swollen, and under xiphoid-process and abdomen is without tenderness, negative word The negative label<Neg>of nothing will be under xiphoid-process and abdomen is separated with tenderness；2, the design of label lacks the branch to its sphere of action It holds, label design can all have very strong hypothesis, it is assumed that medicine text refines standard, but practical case history is often limited by author's Language Mastery, it may appear that punctuation mark, the subject of mistake are arbitrarily ignored, multiagent mixing combination, frequently use english abbreviation etc. Situations such as, lead to occur position coverage and mismatch and take beyond normal punctuate range, multiagent and modification, so that after entity extraction Understand difficult, such as: no obvious inducement lacks appetite, have a vomiting of discontinuity, gastric content for several times, foam-like of defecating, 4 Times/day, spirit is slightly tired.Model is not decorated in symptom modification " discontinuity ", " gastric content is for several times ", " foam-like ", " 4 times/day " The mark enclosed has to carry out analyzing sentence fragments in the analysis of subsequent substantive standardization, such as there is the use mistake of punctuation mark again, compared with It is easy to cause substantive standard mistake；3, existing medical knowledge base utilizes insufficient.There is very more dedicated vocabulary in medicine, Especially more than long 5 words of quite a few word.Existing technology does not account for the identification that entity is corrected using above-mentioned knowledge base. Such as: ICD10 " mumps is with other complication.", in the identification of general mark, the popular parotid gland can be identified as It is scorching.

At present in existing technology, there is no a kind of technical solutions for being able to solve above-mentioned technical problem, specifically, lack A kind of few control method and control device that identification is labeled to Chinese corpus.

Summary of the invention

For technological deficiency of the existing technology, the object of the present invention is to provide one kind to be labeled knowledge to Chinese corpus Other control method identifies Chinese corpus by corpus model, includes the following steps:

A. determine that vector quantization Chinese corpus, the word vector dictionary are based on conventional based on word vector dictionary and Chinese corpus Chinese corpus and medicine corpus determine；

B. vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus；

C. timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark, the mark for marking Chinese corpus Note information, which includes at least, simplifies mark and compound mark.

Preferably, further include step i before the step a: knowledge base preanalysis being carried out to Chinese corpus, determines symbol Close the first Chinese corpus of mark condition；

Correspondingly, the second Chinese corpus in the step a, based on word vector dictionary and in addition to the first Chinese corpus Determine vector quantization Chinese corpus；

Correspondingly, in the step c, will be inferred in the Chinese corpus of determining mark and step i by maximum likelihood First Chinese corpus is as Chinese corpus output.

Preferably, in the step c, the maximum likelihood deduction includes:

C1: the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated；

C2: the second value of all timing Chinese corpus labelings as unit of sentence is calculated.

Preferably, the step c1 is determined based on following formula:

Wherein, X is the Chinese corpus that a length is N, and Y is mark, and Transition is The matrix of one T*T, Emission are the matrix of a N*T.

Preferably, the step c2 includes at least following formula:

Wherein, the loss is missing

Function, the X are the Chinese corpus that a length is N, and the Y is mark.

Preferably, the mark of simplifying includes at least: symptom, inspection, treatment, drug, disease, negative, the cause of disease, medical thing Part and time.

Preferably, the compound mark includes at least orientation, dissection/physiology, clinical manifestation, symptom modification, check item, inspection Come to an end fruit, drug modification, treatment modification, disease modification, medical history type specification.

Preferably, further include step d after the step c: one or more is marked into Chinese corpus according in mark Literary corpus and the form for marking Chinese corpus cut-point are used as output.

According to another aspect of the present invention, a kind of control device that identification is labeled to Chinese corpus is provided, Chinese corpus is identified by corpus model, comprising:

First determining device: vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus；

Second determining device: vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus；

Third determining device: timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark.

Preferably, further include the 4th determining device: knowledge base preanalysis being carried out to Chinese corpus, is determined for compliance with mark condition The first Chinese corpus.

Preferably, the third determining device includes:

First processing unit: the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated；

Second processing device: the second value of all timing Chinese corpus labelings as unit of sentence is calculated.

Preferably, further includes:

Third processing unit: one or more is marked into Chinese corpus according to the Chinese corpus of mark and marks Chinese corpus The form of cut-point is as output.

The invention discloses a kind of control methods and control device that identification is labeled to Chinese corpus, pass through corpus Model identifies Chinese corpus, vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus, by vector quantization Chinese corpus carries out sequential correlation analysis, determines timing Chinese corpus；Timing Chinese corpus is subjected to maximum likelihood and infers determination Mark Chinese corpus.In label design, fine granularity design and coarse-grain are designed and are deposited.In label design, " vector section is added Fall " label, for assisting following entities to understand and the work such as the definition for the period is analyzed.In the deduction of tag recognition maximum likelihood, add Add medical knowledge base on-line amending module.It is stacked in the Time-Series analysis of sentence using multilayer LSTM, analyzes words forward-backward correlation, and Reflection is on the output.The present invention is using simply, and it is convenient to operate, and powerful, practical, learning ability is powerful, efficiently quasi- Really, there is high commercial value.

Detailed description of the invention

Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:

Fig. 1 shows a specific embodiment of the invention, and a kind of pair of Chinese corpus is labeled the control method of identification Idiographic flow schematic diagram；

Fig. 2 shows the first embodiment of the present invention, a kind of pair of Chinese corpus is labeled the control method of identification Idiographic flow schematic diagram；

Fig. 3 shows the second embodiment of the present invention, and timing Chinese corpus is carried out maximum likelihood and infers determining mark The idiographic flow schematic diagram of Chinese corpus；And

Fig. 4 shows another embodiment of the present invention, and a kind of pair of Chinese corpus is labeled the control of identification The module connection diagram of device.

Specific embodiment

In order to preferably technical solution of the present invention be made clearly to show, the present invention is made into one with reference to the accompanying drawing Walk explanation.

Fig. 1 shows a specific embodiment of the invention, and a kind of pair of Chinese corpus is labeled the control method of identification Idiographic flow schematic diagram, Chinese corpus is identified it will be appreciated by those skilled in the art that the present invention mainly passes through corpus model, To by Chinese corpus be determined as can with auxiliary diagnosis, hospital guide, quality of case history control, medical report analysis and medical data analysis Compartmented mode, itself be not related to diagnosis to disease, treatment, belong to the object that can be patentable, herein not It gives and repeating.

Specifically, the present invention describes one kind by specific embodiment shown in Fig. 1 and is labeled to Chinese corpus The idiographic flow schematic diagram of the control method of identification, includes the following steps:

Firstly, entering step S101, vector quantization Chinese corpus, the word are determined based on word vector dictionary and Chinese corpus Vector dictionary is determined based on conventional Chinese corpus and medicine corpus, it will be appreciated by those skilled in the art that by according to routine Chinese corpus and the combined training of medicine corpus and obtain word vector dictionary, tabled look-up according to words allusion quotation, will corresponding word vector sequence Column are sent into sequential correlation analysis, indicate word vector using 128 n dimensional vector ns at present, in such embodiments, the Chinese corpus By the processing of the word vector dictionary, to obtain the Chinese corpus by vector quantization.

Then, S102 is entered step, vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus, In such embodiments, the present invention is advantageously according to LSTM model, using the vector quantization Chinese corpus as input, described Be analyzed and processed in LSTM model, to get timing Chinese corpus, and the purpose for obtaining timing Chinese corpus be in order to Preferably the similarity of each words in the Chinese corpus is compared with markup information, so that it is determined that with the Chinese Expect that closest markup information is labeled, the present invention is stacked using multilayer LSTM, analyzes words forward-backward correlation, and be reflected in It on it is exported, at present under data volume, is stacked using two layers, first layer is two-way LSTM, and the second layer is unidirectional LSTM.

Finally, entering step S103, timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark, it is described The markup information for marking Chinese corpus, which includes at least, simplifies mark and compound mark, it will be appreciated by those skilled in the art that of the invention The mode of two kinds of markup informations is essentially disclosed, mark is respectively simplified and compound mark, the mark of simplifying includes at least: Symptom, inspection, treatment, drug, disease, negative, the cause of disease, medical events and time, it is described it is compound mark include at least orientation, Dissection/physiology, clinical manifestation, symptom modification, check item, inspection result, drug modification, treatment modification, disease modification, medical history class Type description, the output of sequential correlation analysis calculate the deduction of its maximum likelihood according to label transition probability.

Further, in Chinese natural language case history, exist for the description of the medicine entity such as symptom, disease, inspection Diversified forms, some are simplified, such as abdominal pain, have detailed and are compounded with other information, such as " paroxysmal cough, and with sooner or later For very ", " left abdominal tenderness is obvious " etc. require to be provided simultaneously with thickness two kinds of particles degree in label design for above situation, I.e. among the above simplify mark and compound mark.

Further, it does in the specific embodiment that the mode that the maximum likelihood is inferred will be described below and further retouches It states, it will not be described here.

Fig. 2 shows the first embodiment of the present invention, a kind of pair of Chinese corpus is labeled the control method of identification Idiographic flow schematic diagram, as a preferred embodiment of the present invention, Fig. 2 shows it is a kind of more completely to Chinese corpus into The technical solution of rower note identification specifically includes the following steps:

Firstly, entering step S201, knowledge base preanalysis is carried out to Chinese corpus, is determined for compliance in the first of mark condition Literary corpus, it will be appreciated by those skilled in the art that for the medical speciality vocabulary of " with a high credibility " that is found in input text, and mark Corresponding label." with a high credibility " refers to that word length reaches certain requirement, requires at least 5 words or more in the present invention, and appear in In knowledge base dictionary, such as: mannitol hexanicotinate, shortage of genetic factors VIII etc., it will be appreciated by those skilled in the art that as model Pair steps necessary when training and prediction, we can execute operation of the step S101 into step S103 to Chinese corpus, i.e., The text marked carries out embedding, and LSTM Time-Series analysis, maximum likelihood deduction etc. operate, in the present embodiment, centering Literary corpus carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition, while the described first Chinese corpus It is performed aforesaid operations.

It will be appreciated by those skilled in the art that in the present invention, it is preferred to the full content of Chinese corpus is carried out mbedding, LSTM Time-Series analysis, maximum likelihood deduction etc. operation will affect mark if part mark text is only carried out aforesaid operations The analysis of the note part front and back text meaning of one's words, but in one extremely particular embodiment, in knowledge base amendment, if knowledge base Preanalysis has marked, by directly according to label given by knowledge base preanalysis subject to, without carry out maximum likelihood push away Disconnected, in such embodiments, the present invention will be specifically described it to step S205 by step S202, refuse herein It repeats.

Then, S202 is entered step, the second Chinese corpus based on word vector dictionary and in addition to the first Chinese corpus is true Vector quantization Chinese corpus is determined, it will be appreciated by those skilled in the art that the step S202 is similar to the step S101, but the main distinction It is, in the step S202, is preferably distinguished the Chinese corpus, i.e., the first Chinese corpus and removes first The second Chinese corpus outside Chinese corpus, the described first Chinese corpus do not need to carry out step S202, that is, do not need to carry out maximum Likelihood is inferred, and the described second Chinese corpus is executed subsequent step.

And then, S203 is entered step, vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese Material, the step S203 can refer to abovementioned steps S102, and it will not be described here.

Subsequently, S204 is entered step, the in the Chinese corpus of determining mark and step i will be inferred by maximum likelihood One Chinese corpus can refer to abovementioned steps S103 as Chinese corpus output, the step S204, and the Chinese corpus is replaced It is changed to the first Chinese corpus and carries out maximum likelihood deduction, determining mark Chinese corpus and step will be finally inferred by maximum likelihood The first Chinese corpus in rapid S201 is as Chinese corpus output.

Finally, entering step S205: one or more is marked Chinese corpus according in the Chinese corpus of mark and mark The form of literary corpus cut-point is as output, the label of deduction, for " cut-point " for clearly defining consecutive identical label, so often A label can regard two subclasses of A and AE as in model, and the AE is indicated A-End, joined based on the punctuate in natural language Position is examined, entity sphere of action is marked, Chinese freedom degree bring is eliminated and understands ambiguity.

It will be appreciated by those skilled in the art that next, the present invention will by embodiment come to technical solution of the present invention do into It describes to one step, as a preferred embodiment of the present invention, Chinese corpus is that " abdomen is slightly swollen, and under xiphoid-process and left abdominal tenderness is bright It is aobvious.Muscle tensility is normal, and knee jerk is without attenuating or hyperfunction, pathology sign feminine gender.", it is identified by mark of the invention, " abdomen " is solution / physiology is cutd open, " slightly " is symptom modification, and " swollen " is clinical manifestation, and " under xiphoid-process " is dissection/physiology, and " left side " is orientation, and " abdomen " is Dissection/physiology, " tenderness " is clinical manifestation, " obvious " to modify for symptom, and " Muscle tensility " is check item, and " normal " is to check knot Fruit, " knee jerk " are check item, and " no attenuating or hyperfunction " is inspection result, and " pathology sign " is check item, and " feminine gender " is to check As a result.

And as another embodiment of the invention, Chinese corpus is that " no obvious inducement lacks appetite, and vomits in stomach It is tolerant for several times, defecate foam-like, 4 times/day, spirit it is slightly tired ", i.e., " no obvious inducement " be the cause of disease, " appetite " be dissection/life Reason, " not good enough " is clinical manifestation, and " vomiting " is symptom, and " gastric content " is symptom modification, and " for several times " is symptom modification, " stool " For dissection/physiology, " foam-like " is clinical manifestation, and " 4 times/day " are modified for symptom, and " spirit " is dissection/physiology, and " slightly " is disease Shape modification, it is " tired " to be modified for symptom.

And as another embodiment of the invention, Chinese corpus be " give oral imuran, hydrocortisone, folic acid, Kai Sili D etc..", wherein " oral " to treat, " imuran ", " hydrocortisone ", " folic acid ", " triumphant to think vertical D " is drug, "." it is vector paragraph.

It will be appreciated by those skilled in the art that one or more is marked Chinese corpus according in the Chinese corpus of mark and mark The form of literary corpus cut-point is very extensive practical in practical applications as output, in a preferred embodiment, Chinese Corpus is that " patient has the gonococcal infection of lower genitourinary tract without urethra week or the abscess of accessory gland.", if according to aforementioned Embodiment mark rule, then should be labeled as " gonococcal infection of lower genitourinary tract ", " abscess " ".", it is clear that Wo Menke To find out, can completely does not embody the meaning that Chinese material is intended by, and according to technical side shown in the present invention Case makes improvements, then can be labeled as " gonococcal infection of lower genitourinary tract is without urethra week or the abscess of accessory gland " ".".In another example " it is then medical to children's hospital, mumps is diagnosed as with other complication.", wherein if according to mould Type identification mark, then can only identify " mumps ", but if by corpus preanalysis, then it directly can be by " prevalence Property parotitis is with other complication "

It is identified as a whole, the identification of Chinese corpus is corrected using above-mentioned corpus knowledge base, improve mark Identification.

Fig. 3 shows the second embodiment of the present invention, and timing Chinese corpus is carried out maximum likelihood and infers determining mark The idiographic flow schematic diagram of Chinese corpus specifically includes the following steps:

Firstly, entering step S1031, first of one or more timing Chinese corpus labelings as unit of sentence is calculated Value, it is preferable that the step S1031 is determined based on following formula:

Wherein, emission calculates emission based on Linear layer as input Score, from a mapping of Xi- > Yi, a dense layer inputs length 2*LSTM width, exports length T, represents every The score of one possible Tag, the sentence X for being N for a length, marks Y accordingly, and Transition score: one The matrix (T2 parameter) of T*T, the matrix (N*T parameter) of score: one N*T of Emission, calculate whole sentence certain The input of the score of mark, C1 are length 2*LSTM width.

Then, S1032 is entered step, the second value of all timing Chinese corpus labelings as unit of sentence is calculated.

Further, the step S1032 includes at least following formula:

The second value is sum_exp (score), and maximum likelihood is inferred, meter The part score C1 for calculating some whole sentence mark calculates the sum_exp (score) of all possible marks, and it is all for defining Eki The sum_exp (score (Y)) of above-mentioned Yki.

Fig. 4 shows another embodiment of the present invention, and a kind of pair of Chinese corpus is labeled the control of identification The module connection diagram of device identifies Chinese corpus by corpus model, including the first determining device: being based on word Vector dictionary and Chinese corpus determine vector quantization Chinese corpus, and the working principle of first determining device can refer to aforementioned Step S101, it will not be described here.

Further, further include the second determining device: vector quantization Chinese corpus being subjected to sequential correlation analysis, determines timing The working principle of Chinese corpus, second determining device can refer to abovementioned steps S102, and it will not be described here.

Further, further include third determining device: timing Chinese corpus being subjected to maximum likelihood and is inferred in determining mark The working principle of literary corpus, the third determining device can refer to abovementioned steps S102, and it will not be described here.

Further, further include the 4th determining device: knowledge base preanalysis being carried out to Chinese corpus, is determined for compliance with mark item The working principle of the Chinese corpus of the first of part, the 4th determining device can refer to abovementioned steps S201, and it will not be described here.

Further, the third determining device includes the first processing unit: calculating the one or more as unit of sentence The working principle of first value of timing Chinese corpus labeling, first processing unit can refer to abovementioned steps S1031, herein It will not go into details.

Further, further include second processing device: calculating the of all timing Chinese corpus labelings as unit of sentence The working principle of two-value, the second processing device can refer to abovementioned steps S1032, and it will not be described here.

Further, further include third processing unit: one or more is marked into Chinese corpus according to the Chinese corpus of mark And the form of the Chinese corpus cut-point of mark, as output, the working principle of the third processing unit can refer to aforementioned step Rapid S205, it will not be described here.

Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring substantive content of the invention.

Claims

1. a kind of control method that identification is labeled to Chinese corpus identifies Chinese corpus by corpus model, It is characterized by comprising the following steps:

A. determine that vector quantization Chinese corpus, the word vector dictionary are based on conventional Chinese based on word vector dictionary and Chinese corpus Corpus and medicine corpus determine；

C. timing Chinese corpus is subjected to maximum likelihood and infers that the Chinese corpus of determining mark, the mark for marking Chinese corpus are believed Breath, which includes at least, simplifies mark and compound mark.

2. control method according to claim 1, which is characterized in that further include step i: centering before the step a Literary corpus carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition.

3. control method according to claim 2, which is characterized in that in the step a, based on word vector dictionary and The second Chinese corpus in addition to the first Chinese corpus determines vector quantization Chinese corpus；

Correspondingly, in the step c, first in the Chinese corpus of determining mark and step i will be inferred by maximum likelihood Chinese corpus is as Chinese corpus output.

4. control method according to claim 1, which is characterized in that in the step c, the maximum likelihood infers packet It includes:

5. control method according to claim 4, which is characterized in that the step c1 is determined based on following formula:

Wherein, X is the Chinese corpus that a length is N, and Y is mark, and Transition is one The matrix of T*T, Emission are the matrix of a N*T.

6. control method according to claim 4, which is characterized in that the step c2 includes at least following formula:

Wherein, the loss is missing function, and the X is that a length is N's Chinese corpus, the Y are mark.

7. control method according to claim 1, which is characterized in that the mark of simplifying includes at least: symptom, inspection, Treatment, drug, disease, negative, the cause of disease, medical events and time.

8. control method according to claim 1, which is characterized in that the compound mark includes at least orientation, dissection/life Reason, clinical manifestation, symptom modification, check item, inspection result, drug modification, treatment modification, disease modification, medical history type specification.

9. control method according to claim 1, which is characterized in that further include step d after the step c: by one A or multiple mark Chinese corpus are used as output according to the form for marking Chinese corpus and mark Chinese corpus cut-point.

10. a kind of control device that identification is labeled to Chinese corpus identifies Chinese corpus by corpus model, Include:

First determining device (1): vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus；

Second determining device (2): vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus；

Third determining device (3): timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark.

11. control device according to claim 10, which is characterized in that further include the 4th determining device (4): to Chinese Material carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition.

12. control device according to claim 10, which is characterized in that the third determining device (3) includes:

First processing unit (31): the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated；

Second processing device (32): the second value of all timing Chinese corpus labelings as unit of sentence is calculated.

13. control device according to claim 10, which is characterized in that further include:

Third processing unit (5): one or more is marked into Chinese corpus according to the Chinese corpus of mark and marks Chinese corpus The form of cut-point is as output.