CN110298040A - A kind of pair of Chinese corpus is labeled the control method and control device of identification - Google Patents
A kind of pair of Chinese corpus is labeled the control method and control device of identification Download PDFInfo
- Publication number
- CN110298040A CN110298040A CN201910539314.1A CN201910539314A CN110298040A CN 110298040 A CN110298040 A CN 110298040A CN 201910539314 A CN201910539314 A CN 201910539314A CN 110298040 A CN110298040 A CN 110298040A
- Authority
- CN
- China
- Prior art keywords
- chinese corpus
- corpus
- chinese
- mark
- timing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H70/00—ICT specially adapted for the handling or processing of medical references
- G16H70/20—ICT specially adapted for the handling or processing of medical references relating to practices or guidelines
Abstract
The invention discloses a kind of control methods that identification is labeled to Chinese corpus, it identifies Chinese corpus by corpus model, include the following steps: that a. determines that vector quantization Chinese corpus, the word vector dictionary are determined based on conventional Chinese corpus and medicine corpus based on word vector dictionary and Chinese corpus;B. vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;C. the progress maximum likelihood deduction of timing Chinese corpus is determined and marks Chinese corpus, the markup information of the mark Chinese corpus, which includes at least, simplifies mark and compound mark.The present invention is using simply, and it is convenient to operate, and powerful, practical, learning ability is powerful, identifies efficiently and accurately, has high commercial value.
Description
Technical field
The invention belongs to computer application in medicine fields, and in particular, to a kind of to be labeled identification to Chinese corpus
Control method and control device.
Background technique
The medicine entity mark of fusion medical knowledge is the steps necessary of medical information computer analysis, can be used for medical treatment and faces
The scenes such as bed auxiliary diagnosis, hospital guide, quality of case history control, medical report analysis and medical data analysis, the field of application
Widely, it for example, it can be applied to HIS system, identifies the case history of natural language and assists diagnosis;It can be applied to network
It registers and hospital guide, identifies the information of patient's readme;It can be applied to case history quality control system, identify natural language part and analyze in it
Contain information.
In existing technology, according to medicine common sense, the labels such as position, orientation, symptom, drug is marked off and are formulated
Mark rule.Every sentence in the corpus of collection is labeled according to above-mentioned rule, then by the data benefit after mark
It is final to obtain the computer software model that identify medicine entity with the method training of machine learning.
However, such technical solution have the defects that it is very more, 1, the design of label it is too general, for example only devise disease
The excessively wide label of the covering scopes such as disease, symptom, treatment, such design too idealize, once negative word occur, repairing in Chinese
The content " segmentation " of certain class label is opened in excuse etc., i.e., for example, such as: abdomen is slightly swollen, and under xiphoid-process and abdomen is without tenderness, negative word
The negative label<Neg>of nothing will be under xiphoid-process and abdomen is separated with tenderness;2, the design of label lacks the branch to its sphere of action
It holds, label design can all have very strong hypothesis, it is assumed that medicine text refines standard, but practical case history is often limited by author's
Language Mastery, it may appear that punctuation mark, the subject of mistake are arbitrarily ignored, multiagent mixing combination, frequently use english abbreviation etc.
Situations such as, lead to occur position coverage and mismatch and take beyond normal punctuate range, multiagent and modification, so that after entity extraction
Understand difficult, such as: no obvious inducement lacks appetite, have a vomiting of discontinuity, gastric content for several times, foam-like of defecating, 4
Times/day, spirit is slightly tired.Model is not decorated in symptom modification " discontinuity ", " gastric content is for several times ", " foam-like ", " 4 times/day "
The mark enclosed has to carry out analyzing sentence fragments in the analysis of subsequent substantive standardization, such as there is the use mistake of punctuation mark again, compared with
It is easy to cause substantive standard mistake;3, existing medical knowledge base utilizes insufficient.There is very more dedicated vocabulary in medicine,
Especially more than long 5 words of quite a few word.Existing technology does not account for the identification that entity is corrected using above-mentioned knowledge base.
Such as: ICD10 " mumps is with other complication.", in the identification of general mark, the popular parotid gland can be identified as
It is scorching.
At present in existing technology, there is no a kind of technical solutions for being able to solve above-mentioned technical problem, specifically, lack
A kind of few control method and control device that identification is labeled to Chinese corpus.
Summary of the invention
For technological deficiency of the existing technology, the object of the present invention is to provide one kind to be labeled knowledge to Chinese corpus
Other control method identifies Chinese corpus by corpus model, includes the following steps:
A. determine that vector quantization Chinese corpus, the word vector dictionary are based on conventional based on word vector dictionary and Chinese corpus
Chinese corpus and medicine corpus determine;
B. vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;
C. timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark, the mark for marking Chinese corpus
Note information, which includes at least, simplifies mark and compound mark.
Preferably, further include step i before the step a: knowledge base preanalysis being carried out to Chinese corpus, determines symbol
Close the first Chinese corpus of mark condition;
Correspondingly, the second Chinese corpus in the step a, based on word vector dictionary and in addition to the first Chinese corpus
Determine vector quantization Chinese corpus;
Correspondingly, in the step c, will be inferred in the Chinese corpus of determining mark and step i by maximum likelihood
First Chinese corpus is as Chinese corpus output.
Preferably, in the step c, the maximum likelihood deduction includes:
C1: the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated;
C2: the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
Preferably, the step c1 is determined based on following formula:
Wherein, X is the Chinese corpus that a length is N, and Y is mark, and Transition is
The matrix of one T*T, Emission are the matrix of a N*T.
Preferably, the step c2 includes at least following formula:
Wherein, the loss is missing
Function, the X are the Chinese corpus that a length is N, and the Y is mark.
Preferably, the mark of simplifying includes at least: symptom, inspection, treatment, drug, disease, negative, the cause of disease, medical thing
Part and time.
Preferably, the compound mark includes at least orientation, dissection/physiology, clinical manifestation, symptom modification, check item, inspection
Come to an end fruit, drug modification, treatment modification, disease modification, medical history type specification.
Preferably, further include step d after the step c: one or more is marked into Chinese corpus according in mark
Literary corpus and the form for marking Chinese corpus cut-point are used as output.
According to another aspect of the present invention, a kind of control device that identification is labeled to Chinese corpus is provided,
Chinese corpus is identified by corpus model, comprising:
First determining device: vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus;
Second determining device: vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;
Third determining device: timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark.
Preferably, further include the 4th determining device: knowledge base preanalysis being carried out to Chinese corpus, is determined for compliance with mark condition
The first Chinese corpus.
Preferably, the third determining device includes:
First processing unit: the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated;
Second processing device: the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
Preferably, further includes:
Third processing unit: one or more is marked into Chinese corpus according to the Chinese corpus of mark and marks Chinese corpus
The form of cut-point is as output.
The invention discloses a kind of control methods and control device that identification is labeled to Chinese corpus, pass through corpus
Model identifies Chinese corpus, vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus, by vector quantization
Chinese corpus carries out sequential correlation analysis, determines timing Chinese corpus;Timing Chinese corpus is subjected to maximum likelihood and infers determination
Mark Chinese corpus.In label design, fine granularity design and coarse-grain are designed and are deposited.In label design, " vector section is added
Fall " label, for assisting following entities to understand and the work such as the definition for the period is analyzed.In the deduction of tag recognition maximum likelihood, add
Add medical knowledge base on-line amending module.It is stacked in the Time-Series analysis of sentence using multilayer LSTM, analyzes words forward-backward correlation, and
Reflection is on the output.The present invention is using simply, and it is convenient to operate, and powerful, practical, learning ability is powerful, efficiently quasi-
Really, there is high commercial value.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention,
Objects and advantages will become more apparent upon:
Fig. 1 shows a specific embodiment of the invention, and a kind of pair of Chinese corpus is labeled the control method of identification
Idiographic flow schematic diagram;
Fig. 2 shows the first embodiment of the present invention, a kind of pair of Chinese corpus is labeled the control method of identification
Idiographic flow schematic diagram;
Fig. 3 shows the second embodiment of the present invention, and timing Chinese corpus is carried out maximum likelihood and infers determining mark
The idiographic flow schematic diagram of Chinese corpus;And
Fig. 4 shows another embodiment of the present invention, and a kind of pair of Chinese corpus is labeled the control of identification
The module connection diagram of device.
Specific embodiment
In order to preferably technical solution of the present invention be made clearly to show, the present invention is made into one with reference to the accompanying drawing
Walk explanation.
Fig. 1 shows a specific embodiment of the invention, and a kind of pair of Chinese corpus is labeled the control method of identification
Idiographic flow schematic diagram, Chinese corpus is identified it will be appreciated by those skilled in the art that the present invention mainly passes through corpus model,
To by Chinese corpus be determined as can with auxiliary diagnosis, hospital guide, quality of case history control, medical report analysis and medical data analysis
Compartmented mode, itself be not related to diagnosis to disease, treatment, belong to the object that can be patentable, herein not
It gives and repeating.
Specifically, the present invention describes one kind by specific embodiment shown in Fig. 1 and is labeled to Chinese corpus
The idiographic flow schematic diagram of the control method of identification, includes the following steps:
Firstly, entering step S101, vector quantization Chinese corpus, the word are determined based on word vector dictionary and Chinese corpus
Vector dictionary is determined based on conventional Chinese corpus and medicine corpus, it will be appreciated by those skilled in the art that by according to routine
Chinese corpus and the combined training of medicine corpus and obtain word vector dictionary, tabled look-up according to words allusion quotation, will corresponding word vector sequence
Column are sent into sequential correlation analysis, indicate word vector using 128 n dimensional vector ns at present, in such embodiments, the Chinese corpus
By the processing of the word vector dictionary, to obtain the Chinese corpus by vector quantization.
Then, S102 is entered step, vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus,
In such embodiments, the present invention is advantageously according to LSTM model, using the vector quantization Chinese corpus as input, described
Be analyzed and processed in LSTM model, to get timing Chinese corpus, and the purpose for obtaining timing Chinese corpus be in order to
Preferably the similarity of each words in the Chinese corpus is compared with markup information, so that it is determined that with the Chinese
Expect that closest markup information is labeled, the present invention is stacked using multilayer LSTM, analyzes words forward-backward correlation, and be reflected in
It on it is exported, at present under data volume, is stacked using two layers, first layer is two-way LSTM, and the second layer is unidirectional LSTM.
Finally, entering step S103, timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark, it is described
The markup information for marking Chinese corpus, which includes at least, simplifies mark and compound mark, it will be appreciated by those skilled in the art that of the invention
The mode of two kinds of markup informations is essentially disclosed, mark is respectively simplified and compound mark, the mark of simplifying includes at least:
Symptom, inspection, treatment, drug, disease, negative, the cause of disease, medical events and time, it is described it is compound mark include at least orientation,
Dissection/physiology, clinical manifestation, symptom modification, check item, inspection result, drug modification, treatment modification, disease modification, medical history class
Type description, the output of sequential correlation analysis calculate the deduction of its maximum likelihood according to label transition probability.
Further, in Chinese natural language case history, exist for the description of the medicine entity such as symptom, disease, inspection
Diversified forms, some are simplified, such as abdominal pain, have detailed and are compounded with other information, such as " paroxysmal cough, and with sooner or later
For very ", " left abdominal tenderness is obvious " etc. require to be provided simultaneously with thickness two kinds of particles degree in label design for above situation,
I.e. among the above simplify mark and compound mark.
Further, it does in the specific embodiment that the mode that the maximum likelihood is inferred will be described below and further retouches
It states, it will not be described here.
Fig. 2 shows the first embodiment of the present invention, a kind of pair of Chinese corpus is labeled the control method of identification
Idiographic flow schematic diagram, as a preferred embodiment of the present invention, Fig. 2 shows it is a kind of more completely to Chinese corpus into
The technical solution of rower note identification specifically includes the following steps:
Firstly, entering step S201, knowledge base preanalysis is carried out to Chinese corpus, is determined for compliance in the first of mark condition
Literary corpus, it will be appreciated by those skilled in the art that for the medical speciality vocabulary of " with a high credibility " that is found in input text, and mark
Corresponding label." with a high credibility " refers to that word length reaches certain requirement, requires at least 5 words or more in the present invention, and appear in
In knowledge base dictionary, such as: mannitol hexanicotinate, shortage of genetic factors VIII etc., it will be appreciated by those skilled in the art that as model
Pair steps necessary when training and prediction, we can execute operation of the step S101 into step S103 to Chinese corpus, i.e.,
The text marked carries out embedding, and LSTM Time-Series analysis, maximum likelihood deduction etc. operate, in the present embodiment, centering
Literary corpus carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition, while the described first Chinese corpus
It is performed aforesaid operations.
It will be appreciated by those skilled in the art that in the present invention, it is preferred to the full content of Chinese corpus is carried out mbedding,
LSTM Time-Series analysis, maximum likelihood deduction etc. operation will affect mark if part mark text is only carried out aforesaid operations
The analysis of the note part front and back text meaning of one's words, but in one extremely particular embodiment, in knowledge base amendment, if knowledge base
Preanalysis has marked, by directly according to label given by knowledge base preanalysis subject to, without carry out maximum likelihood push away
Disconnected, in such embodiments, the present invention will be specifically described it to step S205 by step S202, refuse herein
It repeats.
Then, S202 is entered step, the second Chinese corpus based on word vector dictionary and in addition to the first Chinese corpus is true
Vector quantization Chinese corpus is determined, it will be appreciated by those skilled in the art that the step S202 is similar to the step S101, but the main distinction
It is, in the step S202, is preferably distinguished the Chinese corpus, i.e., the first Chinese corpus and removes first
The second Chinese corpus outside Chinese corpus, the described first Chinese corpus do not need to carry out step S202, that is, do not need to carry out maximum
Likelihood is inferred, and the described second Chinese corpus is executed subsequent step.
And then, S203 is entered step, vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese
Material, the step S203 can refer to abovementioned steps S102, and it will not be described here.
Subsequently, S204 is entered step, the in the Chinese corpus of determining mark and step i will be inferred by maximum likelihood
One Chinese corpus can refer to abovementioned steps S103 as Chinese corpus output, the step S204, and the Chinese corpus is replaced
It is changed to the first Chinese corpus and carries out maximum likelihood deduction, determining mark Chinese corpus and step will be finally inferred by maximum likelihood
The first Chinese corpus in rapid S201 is as Chinese corpus output.
Finally, entering step S205: one or more is marked Chinese corpus according in the Chinese corpus of mark and mark
The form of literary corpus cut-point is as output, the label of deduction, for " cut-point " for clearly defining consecutive identical label, so often
A label can regard two subclasses of A and AE as in model, and the AE is indicated A-End, joined based on the punctuate in natural language
Position is examined, entity sphere of action is marked, Chinese freedom degree bring is eliminated and understands ambiguity.
It will be appreciated by those skilled in the art that next, the present invention will by embodiment come to technical solution of the present invention do into
It describes to one step, as a preferred embodiment of the present invention, Chinese corpus is that " abdomen is slightly swollen, and under xiphoid-process and left abdominal tenderness is bright
It is aobvious.Muscle tensility is normal, and knee jerk is without attenuating or hyperfunction, pathology sign feminine gender.", it is identified by mark of the invention, " abdomen " is solution
/ physiology is cutd open, " slightly " is symptom modification, and " swollen " is clinical manifestation, and " under xiphoid-process " is dissection/physiology, and " left side " is orientation, and " abdomen " is
Dissection/physiology, " tenderness " is clinical manifestation, " obvious " to modify for symptom, and " Muscle tensility " is check item, and " normal " is to check knot
Fruit, " knee jerk " are check item, and " no attenuating or hyperfunction " is inspection result, and " pathology sign " is check item, and " feminine gender " is to check
As a result.
And as another embodiment of the invention, Chinese corpus is that " no obvious inducement lacks appetite, and vomits in stomach
It is tolerant for several times, defecate foam-like, 4 times/day, spirit it is slightly tired ", i.e., " no obvious inducement " be the cause of disease, " appetite " be dissection/life
Reason, " not good enough " is clinical manifestation, and " vomiting " is symptom, and " gastric content " is symptom modification, and " for several times " is symptom modification, " stool "
For dissection/physiology, " foam-like " is clinical manifestation, and " 4 times/day " are modified for symptom, and " spirit " is dissection/physiology, and " slightly " is disease
Shape modification, it is " tired " to be modified for symptom.
And as another embodiment of the invention, Chinese corpus be " give oral imuran, hydrocortisone, folic acid,
Kai Sili D etc..", wherein " oral " to treat, " imuran ", " hydrocortisone ", " folic acid ", " triumphant to think vertical D " is drug,
"." it is vector paragraph.
It will be appreciated by those skilled in the art that one or more is marked Chinese corpus according in the Chinese corpus of mark and mark
The form of literary corpus cut-point is very extensive practical in practical applications as output, in a preferred embodiment, Chinese
Corpus is that " patient has the gonococcal infection of lower genitourinary tract without urethra week or the abscess of accessory gland.", if according to aforementioned
Embodiment mark rule, then should be labeled as " gonococcal infection of lower genitourinary tract ", " abscess " ".", it is clear that Wo Menke
To find out, can completely does not embody the meaning that Chinese material is intended by, and according to technical side shown in the present invention
Case makes improvements, then can be labeled as " gonococcal infection of lower genitourinary tract is without urethra week or the abscess of accessory gland "
".".In another example " it is then medical to children's hospital, mumps is diagnosed as with other complication.", wherein if according to mould
Type identification mark, then can only identify " mumps ", but if by corpus preanalysis, then it directly can be by " prevalence
Property parotitis is with other complication "
It is identified as a whole, the identification of Chinese corpus is corrected using above-mentioned corpus knowledge base, improve mark
Identification.
Fig. 3 shows the second embodiment of the present invention, and timing Chinese corpus is carried out maximum likelihood and infers determining mark
The idiographic flow schematic diagram of Chinese corpus specifically includes the following steps:
Firstly, entering step S1031, first of one or more timing Chinese corpus labelings as unit of sentence is calculated
Value, it is preferable that the step S1031 is determined based on following formula:
Wherein, emission calculates emission based on Linear layer as input
Score, from a mapping of Xi- > Yi, a dense layer inputs length 2*LSTM width, exports length T, represents every
The score of one possible Tag, the sentence X for being N for a length, marks Y accordingly, and Transition score: one
The matrix (T2 parameter) of T*T, the matrix (N*T parameter) of score: one N*T of Emission, calculate whole sentence certain
The input of the score of mark, C1 are length 2*LSTM width.
Then, S1032 is entered step, the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
Further, the step S1032 includes at least following formula:
The second value is sum_exp (score), and maximum likelihood is inferred, meter
The part score C1 for calculating some whole sentence mark calculates the sum_exp (score) of all possible marks, and it is all for defining Eki
The sum_exp (score (Y)) of above-mentioned Yki.
Fig. 4 shows another embodiment of the present invention, and a kind of pair of Chinese corpus is labeled the control of identification
The module connection diagram of device identifies Chinese corpus by corpus model, including the first determining device: being based on word
Vector dictionary and Chinese corpus determine vector quantization Chinese corpus, and the working principle of first determining device can refer to aforementioned
Step S101, it will not be described here.
Further, further include the second determining device: vector quantization Chinese corpus being subjected to sequential correlation analysis, determines timing
The working principle of Chinese corpus, second determining device can refer to abovementioned steps S102, and it will not be described here.
Further, further include third determining device: timing Chinese corpus being subjected to maximum likelihood and is inferred in determining mark
The working principle of literary corpus, the third determining device can refer to abovementioned steps S102, and it will not be described here.
Further, further include the 4th determining device: knowledge base preanalysis being carried out to Chinese corpus, is determined for compliance with mark item
The working principle of the Chinese corpus of the first of part, the 4th determining device can refer to abovementioned steps S201, and it will not be described here.
Further, the third determining device includes the first processing unit: calculating the one or more as unit of sentence
The working principle of first value of timing Chinese corpus labeling, first processing unit can refer to abovementioned steps S1031, herein
It will not go into details.
Further, further include second processing device: calculating the of all timing Chinese corpus labelings as unit of sentence
The working principle of two-value, the second processing device can refer to abovementioned steps S1032, and it will not be described here.
Further, further include third processing unit: one or more is marked into Chinese corpus according to the Chinese corpus of mark
And the form of the Chinese corpus cut-point of mark, as output, the working principle of the third processing unit can refer to aforementioned step
Rapid S205, it will not be described here.
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned
Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow
Ring substantive content of the invention.
Claims (13)
1. a kind of control method that identification is labeled to Chinese corpus identifies Chinese corpus by corpus model,
It is characterized by comprising the following steps:
A. determine that vector quantization Chinese corpus, the word vector dictionary are based on conventional Chinese based on word vector dictionary and Chinese corpus
Corpus and medicine corpus determine;
B. vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;
C. timing Chinese corpus is subjected to maximum likelihood and infers that the Chinese corpus of determining mark, the mark for marking Chinese corpus are believed
Breath, which includes at least, simplifies mark and compound mark.
2. control method according to claim 1, which is characterized in that further include step i: centering before the step a
Literary corpus carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition.
3. control method according to claim 2, which is characterized in that in the step a, based on word vector dictionary and
The second Chinese corpus in addition to the first Chinese corpus determines vector quantization Chinese corpus;
Correspondingly, in the step c, first in the Chinese corpus of determining mark and step i will be inferred by maximum likelihood
Chinese corpus is as Chinese corpus output.
4. control method according to claim 1, which is characterized in that in the step c, the maximum likelihood infers packet
It includes:
C1: the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated;
C2: the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
5. control method according to claim 4, which is characterized in that the step c1 is determined based on following formula:
Wherein, X is the Chinese corpus that a length is N, and Y is mark, and Transition is one
The matrix of T*T, Emission are the matrix of a N*T.
6. control method according to claim 4, which is characterized in that the step c2 includes at least following formula:
Wherein, the loss is missing function, and the X is that a length is N's
Chinese corpus, the Y are mark.
7. control method according to claim 1, which is characterized in that the mark of simplifying includes at least: symptom, inspection,
Treatment, drug, disease, negative, the cause of disease, medical events and time.
8. control method according to claim 1, which is characterized in that the compound mark includes at least orientation, dissection/life
Reason, clinical manifestation, symptom modification, check item, inspection result, drug modification, treatment modification, disease modification, medical history type specification.
9. control method according to claim 1, which is characterized in that further include step d after the step c: by one
A or multiple mark Chinese corpus are used as output according to the form for marking Chinese corpus and mark Chinese corpus cut-point.
10. a kind of control device that identification is labeled to Chinese corpus identifies Chinese corpus by corpus model,
Include:
First determining device (1): vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus;
Second determining device (2): vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;
Third determining device (3): timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark.
11. control device according to claim 10, which is characterized in that further include the 4th determining device (4): to Chinese
Material carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition.
12. control device according to claim 10, which is characterized in that the third determining device (3) includes:
First processing unit (31): the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated;
Second processing device (32): the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
13. control device according to claim 10, which is characterized in that further include:
Third processing unit (5): one or more is marked into Chinese corpus according to the Chinese corpus of mark and marks Chinese corpus
The form of cut-point is as output.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910539314.1A CN110298040A (en) | 2019-06-20 | 2019-06-20 | A kind of pair of Chinese corpus is labeled the control method and control device of identification |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910539314.1A CN110298040A (en) | 2019-06-20 | 2019-06-20 | A kind of pair of Chinese corpus is labeled the control method and control device of identification |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110298040A true CN110298040A (en) | 2019-10-01 |
Family
ID=68028502
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910539314.1A Pending CN110298040A (en) | 2019-06-20 | 2019-06-20 | A kind of pair of Chinese corpus is labeled the control method and control device of identification |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110298040A (en) |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105894088A (en) * | 2016-03-25 | 2016-08-24 | 苏州赫博特医疗信息科技有限公司 | Medical information extraction system and method based on depth learning and distributed semantic features |
CN107977361A (en) * | 2017-12-06 | 2018-05-01 | 哈尔滨工业大学深圳研究生院 | The Chinese clinical treatment entity recognition method represented based on deep semantic information |
CN108897989A (en) * | 2018-06-06 | 2018-11-27 | 大连理工大学 | A kind of biological event abstracting method based on candidate events element attention mechanism |
CN109471895A (en) * | 2018-10-29 | 2019-03-15 | 清华大学 | The extraction of electronic health record phenotype, phenotype name authority method and system |
CN109670177A (en) * | 2018-12-20 | 2019-04-23 | 翼健(上海)信息科技有限公司 | One kind realizing the semantic normalized control method of medicine and control device based on LSTM |
-
2019
- 2019-06-20 CN CN201910539314.1A patent/CN110298040A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105894088A (en) * | 2016-03-25 | 2016-08-24 | 苏州赫博特医疗信息科技有限公司 | Medical information extraction system and method based on depth learning and distributed semantic features |
CN107977361A (en) * | 2017-12-06 | 2018-05-01 | 哈尔滨工业大学深圳研究生院 | The Chinese clinical treatment entity recognition method represented based on deep semantic information |
CN108897989A (en) * | 2018-06-06 | 2018-11-27 | 大连理工大学 | A kind of biological event abstracting method based on candidate events element attention mechanism |
CN109471895A (en) * | 2018-10-29 | 2019-03-15 | 清华大学 | The extraction of electronic health record phenotype, phenotype name authority method and system |
CN109670177A (en) * | 2018-12-20 | 2019-04-23 | 翼健(上海)信息科技有限公司 | One kind realizing the semantic normalized control method of medicine and control device based on LSTM |
Non-Patent Citations (2)
Title |
---|
ROLAND ROLLER 等: "A fine-grained corpus annotation schema of German nephrology records", PROCEEDINGS OF THE CLINICAL NATURAL LANGUAGE PROCESSING WORKSHOP, pages 69 * |
YING QIN 等: "Research of Clinical Named Entity Recognition Based on Bi-LSTM-CRF", JOURNAL OF SHANGHAI JIAOTONG UNIVERSITY (SCIENCE), pages 392 * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110032648B (en) | Medical record structured analysis method based on medical field entity | |
CN112214995B (en) | Hierarchical multitasking term embedded learning for synonym prediction | |
CN112597774B (en) | Chinese medical named entity recognition method, system, storage medium and equipment | |
CN108628824A (en) | A kind of entity recognition method based on Chinese electronic health record | |
CN110705293A (en) | Electronic medical record text named entity recognition method based on pre-training language model | |
CN106844351B (en) | Medical institution organization entity identification method and device oriented to multiple data sources | |
Alhumoud et al. | Arabic sentiment analysis using recurrent neural networks: a review | |
CN111783466A (en) | Named entity identification method for Chinese medical records | |
CN110688855A (en) | Chinese medical entity identification method and system based on machine learning | |
Soysal et al. | Design and evaluation of an ontology based information extraction system for radiological reports | |
CN112241457A (en) | Event detection method for event of affair knowledge graph fused with extension features | |
CN105184053A (en) | Automatic coding method and system for Chinese medical service project information | |
Miao et al. | Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches | |
CN106909783A (en) | A kind of case history textual medical Methods of Knowledge Discovering Based based on timeline | |
CN111651991B (en) | Medical named entity identification method utilizing multi-model fusion strategy | |
CN110069779A (en) | The symptom entity recognition method and relevant apparatus of medical text | |
WO2021174695A1 (en) | Machine learning-based medicine recognition method and related device | |
Gao et al. | Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF | |
CN110909122A (en) | Information processing method and related equipment | |
CN114528411A (en) | Automatic construction method, device and medium for Chinese medicine knowledge graph | |
CN107122582B (en) | diagnosis and treatment entity identification method and device facing multiple data sources | |
Cing et al. | Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language | |
Shin et al. | Automatic classification of thyroid findings using static and contextualized ensemble natural language processing systems: development study | |
Wong et al. | isentenizer-: Multilingual sentence boundary detection model | |
CN111368532B (en) | Topic word embedding disambiguation method and system based on LDA |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |