Summary of the invention
The embodiment of the present invention provides a kind of symptom and sign class entity recognition method and device towards multi-data source, with solution
Certainly current Entity recognition scheme can not accurately carry out the problem of symptom and sign class Entity recognition.
In order to achieve the above objectives, the present invention adopts the following technical scheme:
A kind of symptom and sign class entity recognition method towards multi-data source, comprising:
Obtain the sentence to be processed in initial data;
The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed;
According to the CRF training pattern that preparatory training is completed, determine each text in sentence to be processed in sentence to be processed
In entity indicia, and determine the entity indicia sequence of sentence to be processed;
According to the entity indicia sequence of sentence to be processed, first group of candidate's entity of sentence to be processed is determined;
According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to the sentence to be processed, really
Fixed second group of candidate's entity;
According to the end character of candidate's entity each in first group of candidate's entity and second group of candidate's entity, to each candidate entity
It is screened, is respectively formed first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity;
If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity be not identical, according to setting in advance
The determination strategy set determines symptom body from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity
Levy class entity result.
Specifically, it is described according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of disease
Symptom and sign class entity result is determined in shape sign class candidate's entity, comprising:
Determine sentence to be processed when carrying out term cutting, if to carry out cutting by pre-set segmentation rules;
If sentence to be processed carries out cutting when carrying out term cutting, through pre-set segmentation rules, then institute is selected
The candidate entity in second group of symptom and sign class candidate's entity is stated as symptom and sign class entity result;
If sentence to be processed when carrying out term cutting, does not carry out cutting by pre-set segmentation rules, then selects
Candidate entity in first group of symptom and sign class candidate's entity is as symptom and sign class entity result;
Alternatively, determine from identical sentence to be processed original character string first group of symptom and sign class candidate entity and
In second group of symptom and sign class candidate's entity, entity number is few, and entity include number of characters more than a group object as symptom
Sign class entity result;
Entity type in the symptom and sign class entity result includes symptom entity and sign entity;
The corresponding reality in first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity
When the entity type of body is inconsistent, select the entity type of the entity in second group of candidate's entity as the corresponding entity
Entity type.
Specifically, the initial data includes electronic health record data, clearing forms data, clinical research data, medical knowledge
Library data, periodical literature data.
Specifically, determining each text in sentence to be processed wait locate according to the CRF training pattern that preparatory training is completed
The entity indicia in sentence is managed, and determines the entity indicia sequence of sentence to be processed, comprising:
From the CRF statistical characteristics of each text extracted in pre-set corpus in sentence to be processed;It is described pre-
Record has each sentence in initial data, the entity in each sentence and the entity in each sentence each in the corpus being first arranged
Position and entity class in sentence;The CRF statistical characteristics include participle characteristic value of each text in each sentence,
Part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value;
According to CRF statistical characteristics of each word in each sentence, a training pattern is determined;The training pattern are as follows:
According to the training pattern, the entity indicia y of each text in sentence to be processed is calculatedj;
The entity indicia of each text is combined, the entity indicia sequence of sentence to be processed is formed;Wherein, x is indicated
The sentence to be processed;yjIndicate the entity indicia of the corresponding text in the position j in sentence to be processed;fi(yj,yj-1, x) indicate to
Handle the functional value that feature i is segmented in sentence;λiFor model parameter;M indicates the number of participle feature;N indicates sentence to be processed
In text point number;Z (x) indicates normalization factor;P (y | x) indicate marking probability of the text in sentence to be processed.
Specifically, determining first group of candidate's entity of sentence to be processed according to the entity indicia sequence of sentence to be processed, wrap
It includes:
Determine the corresponding participle characteristic value of each text in entity indicia sequence, and according to participle characteristic value determination to
Handle first group of candidate's entity of sentence.
Further, it is somebody's turn to do the symptom and sign class entity recognition method towards multi-data source, further includes:
It is not marked in pre-set corpus in the sentence to be processed, according to formula:
Determine the uncertain value of each entity in sentence to be processed;Its
In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th
The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed;
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, if matching
Success, then save the entity indicia of the entity of successful match;
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label;
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold
The sentence to be processed of value is added in the corpus, to carry out corpus update;
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed;
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is total for the entity predicted in sentence to be processed
The entity number in pre-set dictionary is appeared in number;B is the entity sum predicted in sentence to be processed.
Specifically, carrying out term to the sentence to be processed according to pre-set symptom and sign class term cutting strategy
Cutting determines second group of candidate's entity, comprising:
Punctuation mark in sentence to be processed is converted into half-angle, and English alphabet is unified for capitalization English letter;
Pre-set non-medical term table is called, checks the original character string in sentence to be processed with the presence or absence of non-medical
Term in nomenclature, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pretreated
Sentence to be processed;
By pretreated sentence to be processed using reverse maximum match principle and pre-set symptom and sign database
Matched, by pretreated sentence to be processed with the standard terminology title or synonym phase in symptom and sign database
The character string matched is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as institute
State the entity type of preliminary entity;
The original character string of pretreated sentence to be processed is matched with pre-set sentence pattern database;
If the sentence pattern in the original character string of the pretreated sentence to be processed and pre-set sentence pattern database
The original character string of the pretreated sentence to be processed is then used reverse maximum match principle and set in advance by format match
The disease ontology database set is matched, by in disease ontology database standard terminology title or synonym match
Character string is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as described first
Walk the entity type of entity;
Using each preliminary entity in pretreated sentence to be processed as second group of candidate's entity.
Specifically, according to the end character of candidate's entity each in first group of candidate's entity and second group of candidate's entity, to each
Candidate entity is screened, and first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are respectively formed,
Include:
Whether the end character for judging each candidate's entity in first group of candidate's entity and second group of candidate's entity is to set in advance
The non-symptom and sign term character set;
If the end character of each candidate's entity is pre-set non-symptom and sign term character, the candidate entity is given up
It abandons.
A kind of symptom and sign class entity recognition device towards multi-data source, comprising:
Sentence acquiring unit to be processed, for obtaining the sentence to be processed in initial data;
Individual character cutting unit determines each of sentence to be processed for the sentence to be processed to be carried out individual character cutting
Text;
Entity indicia sequence determination unit, for determining sentence to be processed according to the CRF training pattern that training is completed in advance
In entity indicia of each text in sentence to be processed, and determine the entity indicia sequence of sentence to be processed;
First group of candidate's entity determination unit determines language to be processed for the entity indicia sequence according to sentence to be processed
First group of candidate's entity of sentence;
Second group of candidate's entity determination unit is used for according to pre-set symptom and sign class term cutting strategy, to institute
It states sentence to be processed and carries out term cutting, determine second group of candidate's entity;
Candidate entity screening unit, for according to candidate's entity each in first group of candidate's entity and second group of candidate's entity
End character screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group of symptom body
Levy class candidate entity;
Symptom and sign class entity result determination unit, in first group of symptom and sign class candidate entity and second group of symptom
When sign class candidate's entity is not identical, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second
Symptom and sign class entity result is determined in group symptom and sign class candidate's entity.
Specifically, the symptom and sign class entity result determination unit, comprising:
Term cutting judgment module, for determining sentence to be processed when carrying out term cutting, if by presetting
Segmentation rules carry out cutting;
Symptom and sign class entity result determining module is used in sentence to be processed when carrying out term cutting, by preparatory
The segmentation rules of setting carry out cutting, then select the candidate entity in second group of symptom and sign class candidate's entity as symptom
Sign class entity result;In sentence to be processed when carrying out term cutting, cutting is not carried out by pre-set segmentation rules,
Then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result;
The symptom and sign class entity result determining module is also used to determine the original word for deriving from identical sentence to be processed
In the first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity for according with string, entity number is few, and entity
A group object more than the number of characters for including is as symptom and sign class entity result;Entity in the symptom and sign class entity result
Type includes symptom entity and sign entity;
Entity type determining module, in first group of symptom and sign class candidate entity and second group of symptom and sign class
When the entity type of corresponding entity is inconsistent in candidate entity, the entity type of the entity in second group of candidate's entity is selected
Entity type as the corresponding entity.
Specifically, the initial data in the sentence acquiring unit to be processed include electronic health record data, clearing forms data,
Clinical research data, medical knowledge base data, periodical literature data.
Further, the entity indicia sequence determination unit, comprising:
CRF statistical characteristics extraction module, for extracting each of sentence to be processed from pre-set corpus
The CRF statistical characteristics of text;Record has each sentence in initial data, the reality in each sentence in the pre-set corpus
Position and entity class of the entity in each sentence in body and each sentence;The CRF statistical characteristics includes each text
Participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value of the word in each sentence;
Training pattern determining module determines a training mould for the CRF statistical characteristics according to each word in each sentence
Type;The training pattern are as follows:
Entity indicia computing module, for calculating the reality of each text in sentence to be processed according to the training pattern
Body marks yj;
Entity indicia sequence determining module forms sentence to be processed for the entity indicia of each text to be combined
Entity indicia sequence;Wherein, x indicates the sentence to be processed;yjIndicate the reality of the corresponding text in the position j in sentence to be processed
Body label;fi(yj,yj-1, x) and indicate the functional value that feature i is segmented in sentence to be processed;λiFor model parameter;M indicates that participle is special
The number of sign;N indicates the text point number in sentence to be processed;Z (x) indicates normalization factor;P (y | x) indicate that text exists
Marking probability in sentence to be processed.
In addition, first group of candidate entity determination unit, is specifically used for:
Determine the corresponding participle characteristic value of each text in entity indicia sequence, and according to participle characteristic value determination to
Handle first group of candidate's entity of sentence.
Further, the symptom and sign class entity recognition device towards multi-data source further includes that corpus updates
Unit is used for:
It is not marked in pre-set corpus in the sentence to be processed, according to formula:
Determine the uncertain value of each entity in sentence to be processed;Its
In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th
The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed;
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, are being matched
When success, the entity indicia of the entity of successful match is saved;
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label;
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold
The sentence to be processed of value is added in the corpus, to carry out corpus update;
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed;
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is total for the entity predicted in sentence to be processed
The entity number in pre-set dictionary is appeared in number;B is the entity sum predicted in sentence to be processed.
In addition, second group of candidate entity determination unit, comprising:
Preprocessing module for the punctuation mark in sentence to be processed to be converted to half-angle, and English alphabet is unified for
Capitalization English letter;Pre-set non-medical term table is called, checks that the original character string in sentence to be processed whether there is
Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, form pre- place
Sentence to be processed after reason;
Symptom and sign ontology library matching module, for pretreated sentence to be processed to be used reverse maximum match principle
It is matched with pre-set symptom and sign database, it will be in pretreated sentence to be processed and in symptom and sign database
Standard terminology title or the character string that matches of synonym extracted out as preliminary entity, and by the standard terminology title or same
Entity type of the term type corresponding to adopted word as the preliminary entity;By the original word of pretreated sentence to be processed
Symbol string is matched with pre-set sentence pattern database;If the original character string of the pretreated sentence to be processed and pre-
Sentence pattern format match in the sentence pattern database being first arranged then adopts the original character string of the pretreated sentence to be processed
Matched with reverse maximum match principle with pre-set disease ontology database, by with the mark in disease ontology database
The character string that quasi- term name or synonym match is extracted out as preliminary entity, and by the standard terminology title or synonym
Entity type of the corresponding term type as the preliminary entity;
Second group of candidate's entity determining module, for using each preliminary entity in pretreated sentence to be processed as
Two groups of candidate's entities.
In addition, candidate's entity screening unit, comprising:
Non- symptom and sign term character judgement module, it is each in first group of candidate's entity and second group of candidate's entity for judging
Whether the end character of candidate entity is pre-set non-symptom and sign term character;
Candidate entity gives up module, is pre-set non-symptom and sign term for the end character in each candidate entity
When character, the candidate entity is given up.
A kind of symptom and sign class entity recognition method and device towards multi-data source provided in an embodiment of the present invention, it is first
First, the sentence to be processed in initial data is obtained;The sentence to be processed is subjected to individual character cutting, is determined in sentence to be processed
Each text;According to the CRF training pattern that preparatory training is completed, determine each text in sentence to be processed in sentence to be processed
In entity indicia, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, determine
First group of candidate's entity of sentence to be processed;Then, according to pre-set symptom and sign class term cutting strategy, to it is described to
It handles sentence and carries out term cutting, determine second group of candidate's entity;According in first group of candidate's entity and second group of candidate's entity
The end character of each candidate's entity, screens each candidate entity, be respectively formed first group of symptom and sign class candidate entity and
Second group of symptom and sign class candidate's entity;If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity
It is not identical, it is candidate from first group of symptom and sign class candidate entity and second group of symptom and sign class according to pre-set determination strategy
Symptom and sign class entity result is determined in entity.The present invention is by condition random field CRF statistical machine learning method and term cutting
Method combines, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, real
The problem of body identification inaccuracy.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
As shown in Figure 1, the embodiment of the present invention provides a kind of symptom and sign class entity recognition method towards multi-data source, packet
It includes:
Sentence to be processed in step 101, acquisition initial data.
The sentence to be processed is carried out individual character cutting by step 102, determines each text in sentence to be processed.
Step 103, according to the CRF training pattern that training is completed in advance, determine each text in sentence to be processed to
The entity indicia in sentence is handled, and determines the entity indicia sequence of sentence to be processed.
Step 104, the entity indicia sequence according to sentence to be processed determine first group of candidate's entity of sentence to be processed.
Step 105, according to pre-set symptom and sign class term cutting strategy, term is carried out to the sentence to be processed
Cutting determines second group of candidate's entity.
Step 106, according in first group of candidate's entity and second group of candidate's entity it is each candidate entity end character, to each
Candidate entity is screened, and first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are respectively formed.
If step 107, first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity be not identical, root
It is true from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity according to pre-set determination strategy
Determine symptom and sign class entity result.
A kind of symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention, firstly, obtaining
Sentence to be processed in initial data;The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed
Word;According to the CRF training pattern that preparatory training is completed, reality of each text in sentence to be processed in sentence to be processed is determined
Body label, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, determine to be processed
First group of candidate's entity of sentence;Then, according to pre-set symptom and sign class term cutting strategy, to the language to be processed
Sentence carries out term cutting, determines second group of candidate's entity;According to each candidate in first group of candidate's entity and second group of candidate's entity
The end character of entity screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group
Symptom and sign class candidate's entity;If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate entity not phase
Together, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity
Middle determining symptom and sign class entity result.The present invention is by condition random field CRF statistical machine learning method and term cutting method
Combine, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, entity know
Not inaccurate problem.
In order to make those skilled in the art be better understood by the present invention, illustrate this hair below with reference to specific example
It is bright.As shown in Figures 2 and 3 (wherein, Fig. 2 is a kind of part A of symptom and sign class entity recognition method towards multi-data source,
Fig. 3 is a kind of part B of symptom and sign class entity recognition method towards multi-data source, is divided into A herein, part B is due to this
The step of inventive embodiments, is more, not indicates the difference on practical significance, and part A and part B form entire step 201 to step
Rapid 220, wherein figure 2 show step 201 to step 211, Fig. 3 shows step 212 to step 220.), the embodiment of the present invention
A kind of symptom and sign class entity recognition method towards multi-data source is provided, comprising:
Sentence to be processed in step 201, acquisition initial data.
Specifically, the initial data includes symptom and sign clinical treatment data, symptom and sign research and development experimental data, symptom
Sign sales data, symptom and sign scientific and technical literature data, symptom and sign electronic commerce data etc., but it is not only limited to this.
The sentence to be processed is carried out individual character cutting by step 202, determines each text in sentence to be processed.
For example, sentence to be processed is " dizzy aggravation before one week, with cough ", then after individual character cutting, each text are as follows: " one "
" week " " preceding " " head " " dizzy " " adding " " play " ", " " companion " " cough " " coughing ".
Step 203, from the CRF statistical nature of each text extracted in pre-set corpus in sentence to be processed
Value.
Record has each sentence in initial data, the entity in each sentence and each language in the pre-set corpus
Position and entity class of the entity in each sentence in sentence;The CRF statistical characteristics includes each text in each sentence
In participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value.
Pre-set corpus can be marked in advance by artificially, such as sentence:
" main suit: dizzy aggravation before one week, with cough.
Physical examination: without tenderness and rebound tenderness, 4 beats/min of gurgling sound."
Then for symptom and sign class entity, can mark out respectively:
C=dizziness P=1:6 1:7t=symptom
C=cough P=1:121:13t=symptom
C=tenderness P=2:42:5t=sign
C=rebound tenderness P=2:72:9t=sign
C=gurgling sound P=2:112:13t=sign
Wherein, c indicates that symptom and sign class entity, P indicate the line number and sentence of sentence in the corpus of symptom and sign class entity place
Character position in son, t indicate that (symptom and sign entity class includes symptom entity and body to symptom and sign entity class in the present invention
Levies in kind body).
For CRF statistical characteristics, such as sentence " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity indicia sequence
It is classified as " OBEOBIEOBIEOOOO ".For example, CRF statistical nature is described as follows shown in table 1 for " pain " word:
Table 1:
Step 204, the CRF statistical characteristics according to each word in each sentence, determine a training pattern.
Wherein, the training pattern are as follows:
Step 205, according to the training pattern, calculate the entity indicia y of each text in sentence to be processedj。
Wherein, x indicates the sentence to be processed;yjIndicate the entity indicia of the corresponding text in the position j in sentence to be processed;
fi(yj,yj-1, x) and indicate the functional value that feature i is segmented in sentence to be processed;λiThe model parameter obtained for model parameter, training
The sum of the training pattern p (y | x) of sentence can be made to reach maximum;M indicates the number of participle feature;N is indicated in sentence to be processed
Text point number;Z (x) indicates normalization factor;P (y | x) indicate marking probability of the text in sentence to be processed.
For fi(yj,yj-1, x), if indicating yj、yj-1, x be both present in corpus, then fi(yj,yj-1, x)=1, otherwise
It is 0.
The entity indicia of each text is combined by step 206, forms the entity indicia sequence of sentence to be processed.
Such as sentence " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity flag sequence is
“OBEOBIEOBIEOOOO”。
Step 207 determines the corresponding participle characteristic value of each text in entity indicia sequence, and according to the participle feature
Value determines first group of candidate's entity of sentence to be processed.
For example, entity flag sequence is for " no tenderness and rebound tenderness, 4 beats/min of gurgling sound "
" OBEOBIEOBIEOOOO " therefore may recognize that first group of candidate's entity is " tenderness ", " rebound tenderness ", " gurgling sound ".
Punctuation mark in sentence to be processed is converted to half-angle, and English alphabet is unified for capitalization English by step 208
Letter.
Step 209 calls pre-set non-medical term table, checks whether the original character string in sentence to be processed is deposited
Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pre-
Treated sentence to be processed.
Pretreated sentence to be processed is used reverse maximum match principle and pre-set symptom body by step 210
Sign database is matched, by pretreated sentence to be processed with standard terminology title in symptom and sign database or same
The character string that adopted word matches is extracted out as preliminary entity, and by term class corresponding to the standard terminology title or synonym
Entity type of the type as the preliminary entity.
Herein, pre-set symptom and sign database may include symptom and sign tables of data as shown in table 2 below, the disease
Shape sign data table can be expand on the basis of international ICD10 and authoritative medical tool book it is built-up, wherein including
Concept category between word and word between synonymy, word and word divides relationship etc., is embodied in standard terminology title in table, same
Adopted word, hypernym etc..
Table 2:
Standard terminology title |
Synonym |
Hypernym title |
Term type |
Pain |
|
|
Symptom |
Headache |
|
Pain |
Symptom |
Tenderness |
|
|
Sign |
Blood pressure |
|
|
Sign |
Heart rate |
|
|
Sign |
Step 211 carries out the original character string of pretreated sentence to be processed and pre-set sentence pattern database
Matching.
If the original character string of step 212, the pretreated sentence to be processed and pre-set sentence pattern database
In sentence pattern format match, then by the original character string of the pretreated sentence to be processed use reverse maximum match principle
Matched with pre-set disease ontology database, by with the standard terminology title or synonym in disease ontology database
The character string to match is extracted out as preliminary entity, and term type corresponding to the standard terminology title or synonym is made
For the entity type of the preliminary entity.
Step 211 and step 212 are to be possible to not be extracted there are also entity in order to avoid remaining character string to fetch herein,
Therefore it needs further judgement and extracts.
Pre-set sentence pattern database herein may include sentence pattern tables of data as shown in table 3 below:
Table 3:
Step 213, using each preliminary entity in pretreated sentence to be processed as second group of candidate's entity.
Through the above steps 210 and step 212 specific rules, second group of final candidate's entity can be formed.
Step 214, judge in first group of candidate's entity and second group of candidate's entity it is each candidate entity end character whether
For pre-set non-symptom and sign term character.
The pre-set non-symptom and sign term character can be such as " medicine, operation, art, inspection " etc..
If the end character of step 215, each candidate entity is pre-set non-symptom and sign term character, by the time
Entity is selected to give up.
After step 215, step 216 or step 219 are executed.
Step 216, in first group of symptom and sign class candidate entity and not identical second group of symptom and sign class candidate's entity,
Determine sentence to be processed when carrying out term cutting, if to carry out cutting by pre-set segmentation rules.
I.e. whether through the above steps 211 and step 212 processing.
After step 216, step 217 or step 218 are executed.
If step 217, sentence to be processed carry out cutting when carrying out term cutting, through pre-set segmentation rules,
Then select the candidate entity in second group of symptom and sign class candidate's entity as symptom and sign class entity result.
If step 218, sentence to be processed are not cut by pre-set segmentation rules when carrying out term cutting
Point, then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result.
Step 219, in first group of symptom and sign class candidate entity and not identical second group of symptom and sign class candidate's entity,
Determine the first group of symptom and sign class candidate entity and second group of symptom body of the original character string from identical sentence to be processed
Levy class candidate entity in, entity number is few, and entity include number of characters more than a group object as symptom and sign class entity knot
Fruit.
For example, initial data is " showing as abdominal distention ".
First group of symptom and sign class candidate's entity is " abdominal distention [symptom] ";
Second group of symptom and sign class candidate's entity is " bulge [symptom] ";
Then, final result is " abdominal distention [symptom] ".
After step 217,218 and step 219, step 220 is executed.
Step 220, the phase in first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity
When the entity type of corresponding entity is inconsistent, select the entity type of the entity in second group of candidate's entity as described opposite
The entity type for the entity answered.
For example, initial data is " having peritonitis performance therewith ".
First group of symptom and sign class candidate's entity is " peritonitis [disease] ";
Second group of symptom and sign class candidate's entity is " peritonitis [symptom] ";
Then, final result is " peritonitis [symptom] ".
201 to step 220 through the above steps, may finally obtain symptom and sign class Entity recognition result.
In addition, being updated to realize to corpus, new sentence pattern feature can be found by manually summarizing, and manually mark
Note is added in corpus;Furthermore it is also possible to be not marked in pre-set corpus in the sentence to be processed, according to
Formula:Determine the uncertain value of each entity in sentence to be processed;Its
In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th
The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed.
For example, " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity indicia sequence is " OBEOBIEOBIEOOOO ", position
Setting sequence is " 0123456789 10 11 12 13 14 ", it will be seen that entity " tenderness ", position are " 12 ", therefore,
KstartIt is 1, KendIt is 2.
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, if matching
Success, then save the entity indicia of the entity of successful match.
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label.
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold
The sentence to be processed of value is added in the corpus, to carry out corpus update.
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed.
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is the entity predicted in sentence to be processed
The entity number in pre-set dictionary is appeared in sum;B is the entity sum predicted in sentence to be processed.
As it can be seen that corpus data needed for Entity recognition may be implemented and utilize semi-supervised self study side by the update of corpus
Method realizes that corpus is enriched constantly, solves the problems, such as corpus number deficiency, incomplete.
A kind of symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention, firstly, obtaining
Sentence to be processed in initial data;The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed
Word;According to the CRF training pattern that preparatory training is completed, reality of each text in sentence to be processed in sentence to be processed is determined
Body label, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, determine to be processed
First group of candidate's entity of sentence;Then, according to pre-set symptom and sign class term cutting strategy, to the language to be processed
Sentence carries out term cutting, determines second group of candidate's entity;According to each candidate in first group of candidate's entity and second group of candidate's entity
The end character of entity screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group
Symptom and sign class candidate's entity;If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate entity not phase
Together, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity
Middle determining symptom and sign class entity result.The present invention is by condition random field CRF statistical machine learning method and term cutting method
Combine, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, entity know
Not inaccurate problem.
Corresponding to above-mentioned Fig. 1, Fig. 2 and embodiment of the method shown in Fig. 3, as shown in figure 4, the embodiment of the present invention provides one kind
Symptom and sign class entity recognition device towards multi-data source, comprising:
Sentence acquiring unit 31 to be processed, for obtaining the sentence to be processed in initial data.
Individual character cutting unit 32 determines every in sentence to be processed for the sentence to be processed to be carried out individual character cutting
A text.
Entity indicia sequence determination unit 33, for determining language to be processed according to the CRF training pattern that training is completed in advance
Entity indicia of each text in sentence to be processed in sentence, and determine the entity indicia sequence of sentence to be processed.
First group of candidate's entity determination unit 34 determines to be processed for the entity indicia sequence according to sentence to be processed
First group of candidate's entity of sentence.
Second group of candidate's entity determination unit 35 is used for according to pre-set symptom and sign class term cutting strategy, right
The sentence to be processed carries out term cutting, determines second group of candidate's entity.
Candidate entity screening unit 36, for according to candidate's entity each in first group of candidate's entity and second group of candidate's entity
End character, each candidate entity is screened, first group of symptom and sign class candidate entity and second group of symptom are respectively formed
Sign class candidate's entity.
Symptom and sign class entity result determination unit 37, in first group of symptom and sign class candidate entity and second group of disease
When shape sign class candidate's entity is not identical, according to pre-set determination strategy from first group of symptom and sign class candidate entity and
Symptom and sign class entity result is determined in two groups of symptom and sign class candidate's entities.
Specifically, as shown in figure 5, the symptom and sign class entity result determination unit 37, comprising:
Term cutting judgment module 371, for determining sentence to be processed when carrying out term cutting, if by setting in advance
The segmentation rules set carry out cutting.
Symptom and sign class entity result determining module 372 is used in sentence to be processed when carrying out term cutting, by pre-
The segmentation rules being first arranged carry out cutting, then select the candidate entity in second group of symptom and sign class candidate's entity as disease
Shape sign class entity result;In sentence to be processed when carrying out term cutting, do not cut by pre-set segmentation rules
Point, then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result.
The symptom and sign class entity result determining module 372 is also used to determine the original for deriving from identical sentence to be processed
In first group of symptom and sign class candidate entity of beginning character string and second group of symptom and sign class candidate's entity, entity number is few, and
A group object more than the number of characters that entity includes is as symptom and sign class entity result;In the symptom and sign class entity result
Entity type includes symptom entity and sign entity.
Entity type determining module 373, in first group of symptom and sign class candidate entity and second group of symptom body
When the entity type of corresponding entity is inconsistent in sign class candidate entity, the entity of the entity in second group of candidate's entity is selected
Entity type of the type as the corresponding entity.
Specifically, the initial data in the sentence acquiring unit 31 to be processed includes electronic health record data, clearing odd number
According to, clinical research data, medical knowledge base data, periodical literature data.
Further, as shown in figure 5, the entity indicia sequence determination unit 33, comprising:
CRF statistical characteristics extraction module 331, for every in sentence to be processed from being extracted in pre-set corpus
The CRF statistical characteristics of a text;Record has each sentence in initial data, in each sentence in the pre-set corpus
Position and entity class of the entity in each sentence in entity and each sentence;The CRF statistical characteristics includes each
Participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature feature of the text in each sentence
Value.
Training pattern determining module 332 determines an instruction for the CRF statistical characteristics according to each word in each sentence
Practice model;The training pattern are as follows:
Entity indicia computing module 333, for calculating each text in sentence to be processed according to the training pattern
Entity indicia yj。
Entity indicia sequence determining module 334 forms language to be processed for the entity indicia of each text to be combined
The entity indicia sequence of sentence;Wherein, x indicates the sentence to be processed;yjIndicate the corresponding text in the position j in sentence to be processed
Entity indicia;fi(yj,yj-1, x) and indicate the functional value that feature i is segmented in sentence to be processed;λiFor model parameter;M indicates participle
The number of feature;N indicates the text point number in sentence to be processed;Z (x) indicates normalization factor;P (y | x) indicate text
Marking probability in sentence to be processed.
In addition, first group of candidate's entity determination unit 34, is specifically used for:
Determine the corresponding participle characteristic value of each text in entity indicia sequence, and according to participle characteristic value determination to
Handle first group of candidate's entity of sentence.
Further, as shown in figure 5, the symptom and sign class entity recognition device towards multi-data source, further includes
Corpus updating unit 38 is used for:
It is not marked in pre-set corpus in the sentence to be processed, according to formula:
Determine the uncertain value of each entity in sentence to be processed;Its
In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th
The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed.
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, are being matched
When success, the entity indicia of the entity of successful match is saved.
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label.
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold
The sentence to be processed of value is added in the corpus, to carry out corpus update.
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed.
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is the entity predicted in sentence to be processed
The entity number in pre-set dictionary is appeared in sum;B is the entity sum predicted in sentence to be processed.
In addition, as shown in figure 5, second group of candidate's entity determination unit 35, comprising:
Preprocessing module 351, for the punctuation mark in sentence to be processed to be converted to half-angle, and English alphabet is unified
For capitalization English letter;Pre-set non-medical term table is called, checks whether the original character string in sentence to be processed is deposited
Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pre-
Treated sentence to be processed.
Symptom and sign ontology library matching module 352, for matching pretreated sentence to be processed using reverse maximum
Principle is matched with pre-set symptom and sign database, by pretreated sentence to be processed with symptom and sign data
The character string that standard terminology title or synonym in library match is extracted out as preliminary entity, and by the standard terminology title
Or entity type of the term type corresponding to synonym as the preliminary entity;By the original of pretreated sentence to be processed
Beginning character string is matched with pre-set sentence pattern database;If the original character string of the pretreated sentence to be processed
With the sentence pattern format match in pre-set sentence pattern database, then by the original character of the pretreated sentence to be processed
String is matched with pre-set disease ontology database using reverse maximum match principle, will in disease ontology database
Standard terminology title or the character string that matches of synonym extracted out as preliminary entity, and by the standard terminology title or same
Entity type of the term type corresponding to adopted word as the preliminary entity.
Second group of candidate's entity determining module 353, for making each preliminary entity in pretreated sentence to be processed
For second group of candidate's entity.
In addition, as shown in figure 5, candidate's entity screening unit 36, comprising:
Non- symptom and sign term character judgement module 361, for judging first group of candidate's entity and second group of candidate's entity
In the end character of each candidate entity whether be pre-set non-symptom and sign term character.
Candidate entity gives up module 362, is pre-set non-symptom and sign for the end character in each candidate entity
When term character, the candidate entity is given up.
It is worth noting that a kind of symptom and sign class Entity recognition dress towards multi-data source provided in an embodiment of the present invention
The specific implementation set may refer to above-mentioned embodiment of the method, and details are not described herein again.
A kind of symptom and sign class entity recognition device towards multi-data source provided in an embodiment of the present invention, firstly, obtaining
Sentence to be processed in initial data;The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed
Word;According to the CRF training pattern that preparatory training is completed, reality of each text in sentence to be processed in sentence to be processed is determined
Body label, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, determine to be processed
First group of candidate's entity of sentence;Then, according to pre-set symptom and sign class term cutting strategy, to the language to be processed
Sentence carries out term cutting, determines second group of candidate's entity;According to each candidate in first group of candidate's entity and second group of candidate's entity
The end character of entity screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group
Symptom and sign class candidate's entity;If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate entity not phase
Together, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity
Middle determining symptom and sign class entity result.The present invention is by condition random field CRF statistical machine learning method and term cutting method
Combine, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, entity know
Not inaccurate problem.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Specific embodiment is applied in the present invention, and principle and implementation of the present invention are described, above embodiments
Explanation be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art,
According to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion in this specification
Appearance should not be construed as limiting the invention.