CN110298040A - A kind of pair of Chinese corpus is labeled the control method and control device of identification - Google Patents

A kind of pair of Chinese corpus is labeled the control method and control device of identification Download PDF

Info

Publication number
CN110298040A
CN110298040A CN201910539314.1A CN201910539314A CN110298040A CN 110298040 A CN110298040 A CN 110298040A CN 201910539314 A CN201910539314 A CN 201910539314A CN 110298040 A CN110298040 A CN 110298040A
Authority
CN
China
Prior art keywords
chinese corpus
corpus
chinese
mark
timing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910539314.1A
Other languages
Chinese (zh)
Inventor
嵇艺
徐盛
罗震
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xiamen Yifangjian Information Technology Co Ltd
Yijian (shanghai) Information Technology Co Ltd
Original Assignee
Xiamen Yifangjian Information Technology Co Ltd
Yijian (shanghai) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiamen Yifangjian Information Technology Co Ltd, Yijian (shanghai) Information Technology Co Ltd filed Critical Xiamen Yifangjian Information Technology Co Ltd
Priority to CN201910539314.1A priority Critical patent/CN110298040A/en
Publication of CN110298040A publication Critical patent/CN110298040A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/20ICT specially adapted for the handling or processing of medical references relating to practices or guidelines

Abstract

The invention discloses a kind of control methods that identification is labeled to Chinese corpus, it identifies Chinese corpus by corpus model, include the following steps: that a. determines that vector quantization Chinese corpus, the word vector dictionary are determined based on conventional Chinese corpus and medicine corpus based on word vector dictionary and Chinese corpus;B. vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;C. the progress maximum likelihood deduction of timing Chinese corpus is determined and marks Chinese corpus, the markup information of the mark Chinese corpus, which includes at least, simplifies mark and compound mark.The present invention is using simply, and it is convenient to operate, and powerful, practical, learning ability is powerful, identifies efficiently and accurately, has high commercial value.

Description

A kind of pair of Chinese corpus is labeled the control method and control device of identification
Technical field
The invention belongs to computer application in medicine fields, and in particular, to a kind of to be labeled identification to Chinese corpus Control method and control device.
Background technique
The medicine entity mark of fusion medical knowledge is the steps necessary of medical information computer analysis, can be used for medical treatment and faces The scenes such as bed auxiliary diagnosis, hospital guide, quality of case history control, medical report analysis and medical data analysis, the field of application Widely, it for example, it can be applied to HIS system, identifies the case history of natural language and assists diagnosis;It can be applied to network It registers and hospital guide, identifies the information of patient's readme;It can be applied to case history quality control system, identify natural language part and analyze in it Contain information.
In existing technology, according to medicine common sense, the labels such as position, orientation, symptom, drug is marked off and are formulated Mark rule.Every sentence in the corpus of collection is labeled according to above-mentioned rule, then by the data benefit after mark It is final to obtain the computer software model that identify medicine entity with the method training of machine learning.
However, such technical solution have the defects that it is very more, 1, the design of label it is too general, for example only devise disease The excessively wide label of the covering scopes such as disease, symptom, treatment, such design too idealize, once negative word occur, repairing in Chinese The content " segmentation " of certain class label is opened in excuse etc., i.e., for example, such as: abdomen is slightly swollen, and under xiphoid-process and abdomen is without tenderness, negative word The negative label<Neg>of nothing will be under xiphoid-process and abdomen is separated with tenderness;2, the design of label lacks the branch to its sphere of action It holds, label design can all have very strong hypothesis, it is assumed that medicine text refines standard, but practical case history is often limited by author's Language Mastery, it may appear that punctuation mark, the subject of mistake are arbitrarily ignored, multiagent mixing combination, frequently use english abbreviation etc. Situations such as, lead to occur position coverage and mismatch and take beyond normal punctuate range, multiagent and modification, so that after entity extraction Understand difficult, such as: no obvious inducement lacks appetite, have a vomiting of discontinuity, gastric content for several times, foam-like of defecating, 4 Times/day, spirit is slightly tired.Model is not decorated in symptom modification " discontinuity ", " gastric content is for several times ", " foam-like ", " 4 times/day " The mark enclosed has to carry out analyzing sentence fragments in the analysis of subsequent substantive standardization, such as there is the use mistake of punctuation mark again, compared with It is easy to cause substantive standard mistake;3, existing medical knowledge base utilizes insufficient.There is very more dedicated vocabulary in medicine, Especially more than long 5 words of quite a few word.Existing technology does not account for the identification that entity is corrected using above-mentioned knowledge base. Such as: ICD10 " mumps is with other complication.", in the identification of general mark, the popular parotid gland can be identified as It is scorching.
At present in existing technology, there is no a kind of technical solutions for being able to solve above-mentioned technical problem, specifically, lack A kind of few control method and control device that identification is labeled to Chinese corpus.
Summary of the invention
For technological deficiency of the existing technology, the object of the present invention is to provide one kind to be labeled knowledge to Chinese corpus Other control method identifies Chinese corpus by corpus model, includes the following steps:
A. determine that vector quantization Chinese corpus, the word vector dictionary are based on conventional based on word vector dictionary and Chinese corpus Chinese corpus and medicine corpus determine;
B. vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;
C. timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark, the mark for marking Chinese corpus Note information, which includes at least, simplifies mark and compound mark.
Preferably, further include step i before the step a: knowledge base preanalysis being carried out to Chinese corpus, determines symbol Close the first Chinese corpus of mark condition;
Correspondingly, the second Chinese corpus in the step a, based on word vector dictionary and in addition to the first Chinese corpus Determine vector quantization Chinese corpus;
Correspondingly, in the step c, will be inferred in the Chinese corpus of determining mark and step i by maximum likelihood First Chinese corpus is as Chinese corpus output.
Preferably, in the step c, the maximum likelihood deduction includes:
C1: the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated;
C2: the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
Preferably, the step c1 is determined based on following formula:
Wherein, X is the Chinese corpus that a length is N, and Y is mark, and Transition is The matrix of one T*T, Emission are the matrix of a N*T.
Preferably, the step c2 includes at least following formula:
Wherein, the loss is missing
Function, the X are the Chinese corpus that a length is N, and the Y is mark.
Preferably, the mark of simplifying includes at least: symptom, inspection, treatment, drug, disease, negative, the cause of disease, medical thing Part and time.
Preferably, the compound mark includes at least orientation, dissection/physiology, clinical manifestation, symptom modification, check item, inspection Come to an end fruit, drug modification, treatment modification, disease modification, medical history type specification.
Preferably, further include step d after the step c: one or more is marked into Chinese corpus according in mark Literary corpus and the form for marking Chinese corpus cut-point are used as output.
According to another aspect of the present invention, a kind of control device that identification is labeled to Chinese corpus is provided, Chinese corpus is identified by corpus model, comprising:
First determining device: vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus;
Second determining device: vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;
Third determining device: timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark.
Preferably, further include the 4th determining device: knowledge base preanalysis being carried out to Chinese corpus, is determined for compliance with mark condition The first Chinese corpus.
Preferably, the third determining device includes:
First processing unit: the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated;
Second processing device: the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
Preferably, further includes:
Third processing unit: one or more is marked into Chinese corpus according to the Chinese corpus of mark and marks Chinese corpus The form of cut-point is as output.
The invention discloses a kind of control methods and control device that identification is labeled to Chinese corpus, pass through corpus Model identifies Chinese corpus, vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus, by vector quantization Chinese corpus carries out sequential correlation analysis, determines timing Chinese corpus;Timing Chinese corpus is subjected to maximum likelihood and infers determination Mark Chinese corpus.In label design, fine granularity design and coarse-grain are designed and are deposited.In label design, " vector section is added Fall " label, for assisting following entities to understand and the work such as the definition for the period is analyzed.In the deduction of tag recognition maximum likelihood, add Add medical knowledge base on-line amending module.It is stacked in the Time-Series analysis of sentence using multilayer LSTM, analyzes words forward-backward correlation, and Reflection is on the output.The present invention is using simply, and it is convenient to operate, and powerful, practical, learning ability is powerful, efficiently quasi- Really, there is high commercial value.
Detailed description of the invention
Upon reading the detailed description of non-limiting embodiments with reference to the following drawings, other feature of the invention, Objects and advantages will become more apparent upon:
Fig. 1 shows a specific embodiment of the invention, and a kind of pair of Chinese corpus is labeled the control method of identification Idiographic flow schematic diagram;
Fig. 2 shows the first embodiment of the present invention, a kind of pair of Chinese corpus is labeled the control method of identification Idiographic flow schematic diagram;
Fig. 3 shows the second embodiment of the present invention, and timing Chinese corpus is carried out maximum likelihood and infers determining mark The idiographic flow schematic diagram of Chinese corpus;And
Fig. 4 shows another embodiment of the present invention, and a kind of pair of Chinese corpus is labeled the control of identification The module connection diagram of device.
Specific embodiment
In order to preferably technical solution of the present invention be made clearly to show, the present invention is made into one with reference to the accompanying drawing Walk explanation.
Fig. 1 shows a specific embodiment of the invention, and a kind of pair of Chinese corpus is labeled the control method of identification Idiographic flow schematic diagram, Chinese corpus is identified it will be appreciated by those skilled in the art that the present invention mainly passes through corpus model, To by Chinese corpus be determined as can with auxiliary diagnosis, hospital guide, quality of case history control, medical report analysis and medical data analysis Compartmented mode, itself be not related to diagnosis to disease, treatment, belong to the object that can be patentable, herein not It gives and repeating.
Specifically, the present invention describes one kind by specific embodiment shown in Fig. 1 and is labeled to Chinese corpus The idiographic flow schematic diagram of the control method of identification, includes the following steps:
Firstly, entering step S101, vector quantization Chinese corpus, the word are determined based on word vector dictionary and Chinese corpus Vector dictionary is determined based on conventional Chinese corpus and medicine corpus, it will be appreciated by those skilled in the art that by according to routine Chinese corpus and the combined training of medicine corpus and obtain word vector dictionary, tabled look-up according to words allusion quotation, will corresponding word vector sequence Column are sent into sequential correlation analysis, indicate word vector using 128 n dimensional vector ns at present, in such embodiments, the Chinese corpus By the processing of the word vector dictionary, to obtain the Chinese corpus by vector quantization.
Then, S102 is entered step, vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus, In such embodiments, the present invention is advantageously according to LSTM model, using the vector quantization Chinese corpus as input, described Be analyzed and processed in LSTM model, to get timing Chinese corpus, and the purpose for obtaining timing Chinese corpus be in order to Preferably the similarity of each words in the Chinese corpus is compared with markup information, so that it is determined that with the Chinese Expect that closest markup information is labeled, the present invention is stacked using multilayer LSTM, analyzes words forward-backward correlation, and be reflected in It on it is exported, at present under data volume, is stacked using two layers, first layer is two-way LSTM, and the second layer is unidirectional LSTM.
Finally, entering step S103, timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark, it is described The markup information for marking Chinese corpus, which includes at least, simplifies mark and compound mark, it will be appreciated by those skilled in the art that of the invention The mode of two kinds of markup informations is essentially disclosed, mark is respectively simplified and compound mark, the mark of simplifying includes at least: Symptom, inspection, treatment, drug, disease, negative, the cause of disease, medical events and time, it is described it is compound mark include at least orientation, Dissection/physiology, clinical manifestation, symptom modification, check item, inspection result, drug modification, treatment modification, disease modification, medical history class Type description, the output of sequential correlation analysis calculate the deduction of its maximum likelihood according to label transition probability.
Further, in Chinese natural language case history, exist for the description of the medicine entity such as symptom, disease, inspection Diversified forms, some are simplified, such as abdominal pain, have detailed and are compounded with other information, such as " paroxysmal cough, and with sooner or later For very ", " left abdominal tenderness is obvious " etc. require to be provided simultaneously with thickness two kinds of particles degree in label design for above situation, I.e. among the above simplify mark and compound mark.
Further, it does in the specific embodiment that the mode that the maximum likelihood is inferred will be described below and further retouches It states, it will not be described here.
Fig. 2 shows the first embodiment of the present invention, a kind of pair of Chinese corpus is labeled the control method of identification Idiographic flow schematic diagram, as a preferred embodiment of the present invention, Fig. 2 shows it is a kind of more completely to Chinese corpus into The technical solution of rower note identification specifically includes the following steps:
Firstly, entering step S201, knowledge base preanalysis is carried out to Chinese corpus, is determined for compliance in the first of mark condition Literary corpus, it will be appreciated by those skilled in the art that for the medical speciality vocabulary of " with a high credibility " that is found in input text, and mark Corresponding label." with a high credibility " refers to that word length reaches certain requirement, requires at least 5 words or more in the present invention, and appear in In knowledge base dictionary, such as: mannitol hexanicotinate, shortage of genetic factors VIII etc., it will be appreciated by those skilled in the art that as model Pair steps necessary when training and prediction, we can execute operation of the step S101 into step S103 to Chinese corpus, i.e., The text marked carries out embedding, and LSTM Time-Series analysis, maximum likelihood deduction etc. operate, in the present embodiment, centering Literary corpus carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition, while the described first Chinese corpus It is performed aforesaid operations.
It will be appreciated by those skilled in the art that in the present invention, it is preferred to the full content of Chinese corpus is carried out mbedding, LSTM Time-Series analysis, maximum likelihood deduction etc. operation will affect mark if part mark text is only carried out aforesaid operations The analysis of the note part front and back text meaning of one's words, but in one extremely particular embodiment, in knowledge base amendment, if knowledge base Preanalysis has marked, by directly according to label given by knowledge base preanalysis subject to, without carry out maximum likelihood push away Disconnected, in such embodiments, the present invention will be specifically described it to step S205 by step S202, refuse herein It repeats.
Then, S202 is entered step, the second Chinese corpus based on word vector dictionary and in addition to the first Chinese corpus is true Vector quantization Chinese corpus is determined, it will be appreciated by those skilled in the art that the step S202 is similar to the step S101, but the main distinction It is, in the step S202, is preferably distinguished the Chinese corpus, i.e., the first Chinese corpus and removes first The second Chinese corpus outside Chinese corpus, the described first Chinese corpus do not need to carry out step S202, that is, do not need to carry out maximum Likelihood is inferred, and the described second Chinese corpus is executed subsequent step.
And then, S203 is entered step, vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese Material, the step S203 can refer to abovementioned steps S102, and it will not be described here.
Subsequently, S204 is entered step, the in the Chinese corpus of determining mark and step i will be inferred by maximum likelihood One Chinese corpus can refer to abovementioned steps S103 as Chinese corpus output, the step S204, and the Chinese corpus is replaced It is changed to the first Chinese corpus and carries out maximum likelihood deduction, determining mark Chinese corpus and step will be finally inferred by maximum likelihood The first Chinese corpus in rapid S201 is as Chinese corpus output.
Finally, entering step S205: one or more is marked Chinese corpus according in the Chinese corpus of mark and mark The form of literary corpus cut-point is as output, the label of deduction, for " cut-point " for clearly defining consecutive identical label, so often A label can regard two subclasses of A and AE as in model, and the AE is indicated A-End, joined based on the punctuate in natural language Position is examined, entity sphere of action is marked, Chinese freedom degree bring is eliminated and understands ambiguity.
It will be appreciated by those skilled in the art that next, the present invention will by embodiment come to technical solution of the present invention do into It describes to one step, as a preferred embodiment of the present invention, Chinese corpus is that " abdomen is slightly swollen, and under xiphoid-process and left abdominal tenderness is bright It is aobvious.Muscle tensility is normal, and knee jerk is without attenuating or hyperfunction, pathology sign feminine gender.", it is identified by mark of the invention, " abdomen " is solution / physiology is cutd open, " slightly " is symptom modification, and " swollen " is clinical manifestation, and " under xiphoid-process " is dissection/physiology, and " left side " is orientation, and " abdomen " is Dissection/physiology, " tenderness " is clinical manifestation, " obvious " to modify for symptom, and " Muscle tensility " is check item, and " normal " is to check knot Fruit, " knee jerk " are check item, and " no attenuating or hyperfunction " is inspection result, and " pathology sign " is check item, and " feminine gender " is to check As a result.
And as another embodiment of the invention, Chinese corpus is that " no obvious inducement lacks appetite, and vomits in stomach It is tolerant for several times, defecate foam-like, 4 times/day, spirit it is slightly tired ", i.e., " no obvious inducement " be the cause of disease, " appetite " be dissection/life Reason, " not good enough " is clinical manifestation, and " vomiting " is symptom, and " gastric content " is symptom modification, and " for several times " is symptom modification, " stool " For dissection/physiology, " foam-like " is clinical manifestation, and " 4 times/day " are modified for symptom, and " spirit " is dissection/physiology, and " slightly " is disease Shape modification, it is " tired " to be modified for symptom.
And as another embodiment of the invention, Chinese corpus be " give oral imuran, hydrocortisone, folic acid, Kai Sili D etc..", wherein " oral " to treat, " imuran ", " hydrocortisone ", " folic acid ", " triumphant to think vertical D " is drug, "." it is vector paragraph.
It will be appreciated by those skilled in the art that one or more is marked Chinese corpus according in the Chinese corpus of mark and mark The form of literary corpus cut-point is very extensive practical in practical applications as output, in a preferred embodiment, Chinese Corpus is that " patient has the gonococcal infection of lower genitourinary tract without urethra week or the abscess of accessory gland.", if according to aforementioned Embodiment mark rule, then should be labeled as " gonococcal infection of lower genitourinary tract ", " abscess " ".", it is clear that Wo Menke To find out, can completely does not embody the meaning that Chinese material is intended by, and according to technical side shown in the present invention Case makes improvements, then can be labeled as " gonococcal infection of lower genitourinary tract is without urethra week or the abscess of accessory gland " ".".In another example " it is then medical to children's hospital, mumps is diagnosed as with other complication.", wherein if according to mould Type identification mark, then can only identify " mumps ", but if by corpus preanalysis, then it directly can be by " prevalence Property parotitis is with other complication "
It is identified as a whole, the identification of Chinese corpus is corrected using above-mentioned corpus knowledge base, improve mark Identification.
Fig. 3 shows the second embodiment of the present invention, and timing Chinese corpus is carried out maximum likelihood and infers determining mark The idiographic flow schematic diagram of Chinese corpus specifically includes the following steps:
Firstly, entering step S1031, first of one or more timing Chinese corpus labelings as unit of sentence is calculated Value, it is preferable that the step S1031 is determined based on following formula:
Wherein, emission calculates emission based on Linear layer as input Score, from a mapping of Xi- > Yi, a dense layer inputs length 2*LSTM width, exports length T, represents every The score of one possible Tag, the sentence X for being N for a length, marks Y accordingly, and Transition score: one The matrix (T2 parameter) of T*T, the matrix (N*T parameter) of score: one N*T of Emission, calculate whole sentence certain The input of the score of mark, C1 are length 2*LSTM width.
Then, S1032 is entered step, the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
Further, the step S1032 includes at least following formula:
The second value is sum_exp (score), and maximum likelihood is inferred, meter The part score C1 for calculating some whole sentence mark calculates the sum_exp (score) of all possible marks, and it is all for defining Eki The sum_exp (score (Y)) of above-mentioned Yki.
Fig. 4 shows another embodiment of the present invention, and a kind of pair of Chinese corpus is labeled the control of identification The module connection diagram of device identifies Chinese corpus by corpus model, including the first determining device: being based on word Vector dictionary and Chinese corpus determine vector quantization Chinese corpus, and the working principle of first determining device can refer to aforementioned Step S101, it will not be described here.
Further, further include the second determining device: vector quantization Chinese corpus being subjected to sequential correlation analysis, determines timing The working principle of Chinese corpus, second determining device can refer to abovementioned steps S102, and it will not be described here.
Further, further include third determining device: timing Chinese corpus being subjected to maximum likelihood and is inferred in determining mark The working principle of literary corpus, the third determining device can refer to abovementioned steps S102, and it will not be described here.
Further, further include the 4th determining device: knowledge base preanalysis being carried out to Chinese corpus, is determined for compliance with mark item The working principle of the Chinese corpus of the first of part, the 4th determining device can refer to abovementioned steps S201, and it will not be described here.
Further, the third determining device includes the first processing unit: calculating the one or more as unit of sentence The working principle of first value of timing Chinese corpus labeling, first processing unit can refer to abovementioned steps S1031, herein It will not go into details.
Further, further include second processing device: calculating the of all timing Chinese corpus labelings as unit of sentence The working principle of two-value, the second processing device can refer to abovementioned steps S1032, and it will not be described here.
Further, further include third processing unit: one or more is marked into Chinese corpus according to the Chinese corpus of mark And the form of the Chinese corpus cut-point of mark, as output, the working principle of the third processing unit can refer to aforementioned step Rapid S205, it will not be described here.
Specific embodiments of the present invention are described above.It is to be appreciated that the invention is not limited to above-mentioned Particular implementation, those skilled in the art can make various deformations or amendments within the scope of the claims, this not shadow Ring substantive content of the invention.

Claims (13)

1. a kind of control method that identification is labeled to Chinese corpus identifies Chinese corpus by corpus model, It is characterized by comprising the following steps:
A. determine that vector quantization Chinese corpus, the word vector dictionary are based on conventional Chinese based on word vector dictionary and Chinese corpus Corpus and medicine corpus determine;
B. vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;
C. timing Chinese corpus is subjected to maximum likelihood and infers that the Chinese corpus of determining mark, the mark for marking Chinese corpus are believed Breath, which includes at least, simplifies mark and compound mark.
2. control method according to claim 1, which is characterized in that further include step i: centering before the step a Literary corpus carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition.
3. control method according to claim 2, which is characterized in that in the step a, based on word vector dictionary and The second Chinese corpus in addition to the first Chinese corpus determines vector quantization Chinese corpus;
Correspondingly, in the step c, first in the Chinese corpus of determining mark and step i will be inferred by maximum likelihood Chinese corpus is as Chinese corpus output.
4. control method according to claim 1, which is characterized in that in the step c, the maximum likelihood infers packet It includes:
C1: the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated;
C2: the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
5. control method according to claim 4, which is characterized in that the step c1 is determined based on following formula:
Wherein, X is the Chinese corpus that a length is N, and Y is mark, and Transition is one The matrix of T*T, Emission are the matrix of a N*T.
6. control method according to claim 4, which is characterized in that the step c2 includes at least following formula:
Wherein, the loss is missing function, and the X is that a length is N's Chinese corpus, the Y are mark.
7. control method according to claim 1, which is characterized in that the mark of simplifying includes at least: symptom, inspection, Treatment, drug, disease, negative, the cause of disease, medical events and time.
8. control method according to claim 1, which is characterized in that the compound mark includes at least orientation, dissection/life Reason, clinical manifestation, symptom modification, check item, inspection result, drug modification, treatment modification, disease modification, medical history type specification.
9. control method according to claim 1, which is characterized in that further include step d after the step c: by one A or multiple mark Chinese corpus are used as output according to the form for marking Chinese corpus and mark Chinese corpus cut-point.
10. a kind of control device that identification is labeled to Chinese corpus identifies Chinese corpus by corpus model, Include:
First determining device (1): vector quantization Chinese corpus is determined based on word vector dictionary and Chinese corpus;
Second determining device (2): vector quantization Chinese corpus is subjected to sequential correlation analysis, determines timing Chinese corpus;
Third determining device (3): timing Chinese corpus is subjected to maximum likelihood and infers the Chinese corpus of determining mark.
11. control device according to claim 10, which is characterized in that further include the 4th determining device (4): to Chinese Material carries out knowledge base preanalysis, is determined for compliance with the first Chinese corpus of mark condition.
12. control device according to claim 10, which is characterized in that the third determining device (3) includes:
First processing unit (31): the first value of one or more timing Chinese corpus labelings as unit of sentence is calculated;
Second processing device (32): the second value of all timing Chinese corpus labelings as unit of sentence is calculated.
13. control device according to claim 10, which is characterized in that further include:
Third processing unit (5): one or more is marked into Chinese corpus according to the Chinese corpus of mark and marks Chinese corpus The form of cut-point is as output.
CN201910539314.1A 2019-06-20 2019-06-20 A kind of pair of Chinese corpus is labeled the control method and control device of identification Pending CN110298040A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910539314.1A CN110298040A (en) 2019-06-20 2019-06-20 A kind of pair of Chinese corpus is labeled the control method and control device of identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910539314.1A CN110298040A (en) 2019-06-20 2019-06-20 A kind of pair of Chinese corpus is labeled the control method and control device of identification

Publications (1)

Publication Number Publication Date
CN110298040A true CN110298040A (en) 2019-10-01

Family

ID=68028502

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910539314.1A Pending CN110298040A (en) 2019-06-20 2019-06-20 A kind of pair of Chinese corpus is labeled the control method and control device of identification

Country Status (1)

Country Link
CN (1) CN110298040A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894088A (en) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 Medical information extraction system and method based on depth learning and distributed semantic features
CN107977361A (en) * 2017-12-06 2018-05-01 哈尔滨工业大学深圳研究生院 The Chinese clinical treatment entity recognition method represented based on deep semantic information
CN108897989A (en) * 2018-06-06 2018-11-27 大连理工大学 A kind of biological event abstracting method based on candidate events element attention mechanism
CN109471895A (en) * 2018-10-29 2019-03-15 清华大学 The extraction of electronic health record phenotype, phenotype name authority method and system
CN109670177A (en) * 2018-12-20 2019-04-23 翼健(上海)信息科技有限公司 One kind realizing the semantic normalized control method of medicine and control device based on LSTM

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105894088A (en) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 Medical information extraction system and method based on depth learning and distributed semantic features
CN107977361A (en) * 2017-12-06 2018-05-01 哈尔滨工业大学深圳研究生院 The Chinese clinical treatment entity recognition method represented based on deep semantic information
CN108897989A (en) * 2018-06-06 2018-11-27 大连理工大学 A kind of biological event abstracting method based on candidate events element attention mechanism
CN109471895A (en) * 2018-10-29 2019-03-15 清华大学 The extraction of electronic health record phenotype, phenotype name authority method and system
CN109670177A (en) * 2018-12-20 2019-04-23 翼健(上海)信息科技有限公司 One kind realizing the semantic normalized control method of medicine and control device based on LSTM

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ROLAND ROLLER 等: "A fine-grained corpus annotation schema of German nephrology records", PROCEEDINGS OF THE CLINICAL NATURAL LANGUAGE PROCESSING WORKSHOP, pages 69 *
YING QIN 等: "Research of Clinical Named Entity Recognition Based on Bi-LSTM-CRF", JOURNAL OF SHANGHAI JIAOTONG UNIVERSITY (SCIENCE), pages 392 *

Similar Documents

Publication Publication Date Title
CN110032648B (en) Medical record structured analysis method based on medical field entity
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
CN112597774B (en) Chinese medical named entity recognition method, system, storage medium and equipment
CN108628824A (en) A kind of entity recognition method based on Chinese electronic health record
CN110705293A (en) Electronic medical record text named entity recognition method based on pre-training language model
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
Alhumoud et al. Arabic sentiment analysis using recurrent neural networks: a review
CN111783466A (en) Named entity identification method for Chinese medical records
CN110688855A (en) Chinese medical entity identification method and system based on machine learning
Soysal et al. Design and evaluation of an ontology based information extraction system for radiological reports
CN112241457A (en) Event detection method for event of affair knowledge graph fused with extension features
CN105184053A (en) Automatic coding method and system for Chinese medical service project information
Miao et al. Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches
CN106909783A (en) A kind of case history textual medical Methods of Knowledge Discovering Based based on timeline
CN111651991B (en) Medical named entity identification method utilizing multi-model fusion strategy
CN110069779A (en) The symptom entity recognition method and relevant apparatus of medical text
WO2021174695A1 (en) Machine learning-based medicine recognition method and related device
Gao et al. Named entity recognition method of Chinese EMR based on BERT-BiLSTM-CRF
CN110909122A (en) Information processing method and related equipment
CN114528411A (en) Automatic construction method, device and medium for Chinese medicine knowledge graph
CN107122582B (en) diagnosis and treatment entity identification method and device facing multiple data sources
Cing et al. Improving accuracy of part-of-speech (POS) tagging using hidden markov model and morphological analysis for Myanmar Language
Shin et al. Automatic classification of thyroid findings using static and contextualized ensemble natural language processing systems: development study
Wong et al. isentenizer-: Multilingual sentence boundary detection model
CN111368532B (en) Topic word embedding disambiguation method and system based on LDA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination