CN106897559B

CN106897559B - A kind of symptom and sign class entity recognition method and device towards multi-data source

Info

Publication number: CN106897559B
Application number: CN201710103706.4A
Authority: CN
Inventors: 李雪莉; 关毅; 黄玉丽
Original assignee: Heilongjiang Teshi Information Technology Co Ltd; Harbin Institute of Technology
Current assignee: Yi Bao Interconnected Medical Information Technology (Beijing) Co., Ltd.; Harbin Institute of Technology
Priority date: 2017-02-24
Filing date: 2017-02-24
Publication date: 2019-09-17
Anticipated expiration: 2037-02-24
Also published as: CN106897559A

Abstract

The present invention provides a kind of symptom and sign class entity recognition method and device towards multi-data source, is related to medical bodies identification technology field.Method includes: the sentence to be processed obtained in initial data；Sentence to be processed is subjected to individual character cutting, determines each text；According to the CRF training pattern that preparatory training is completed, entity indicia of each text in sentence to be processed in sentence to be processed is determined, and determine the entity indicia sequence of sentence to be processed；According to the entity indicia sequence of sentence to be processed, first group of candidate's entity of sentence to be processed is determined；According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to sentence to be processed, determines second group of candidate's entity；Each candidate entity is screened, first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are respectively formed；Symptom and sign class entity result is determined according to pre-set determination strategy.

Description

A kind of symptom and sign class entity recognition method and device towards multi-data source

Technical field

The present invention relates to medical bodies identification technology field more particularly to a kind of symptom and sign class towards multi-data source are real Body recognition methods and device.

Background technique

Currently, with the development of network and medical information technology, Chinese population gradually tend to astogeny, internet medical treatment by It gradually rises, people are higher and higher to demand for medical service level.And this contradiction also between the relative shortage of medical resource is got over Invention is aobvious.It realizes intelligent diagnostics and the treatment of disease, be unable to do without and identify disease and its symptom and sign from medical big data Corresponding relationship, this process is symptom and sign Entity recognition process.

In recent years, the important step as the analysis of medical treatment & health data, medical bodies identification (such as symptom and sign class Entity recognition) medical terms present in related text can be extracted, the performance of follow-up study is played an important role.Mesh Preceding common entity recognition techniques have medicine Entity recognition based on vocabulary and based on condition random field (Conditional Random Fields, abbreviation CRF) medicine Entity recognition, however the medicine Entity recognition based on vocabulary relies solely on terminology bank Matching lacks context of co-text identification, and term storehouse matching exists compared with big limitation.And the medicine Entity recognition skill based on CRF Art lacks the application of big data corpus and language rule, and corpus is the corpus after artificial mark, semi-supervised without utilizing The methods of study, the use for increasing the unlabeled data huger to quantity lack so that model is incomplete based on linguistics With the rule of medical information, model is relied solely on, it is strong to the less pertinence of data.As it can be seen that current Entity recognition scheme is simultaneously Symptom and sign class Entity recognition cannot accurately be carried out.

Summary of the invention

The embodiment of the present invention provides a kind of symptom and sign class entity recognition method and device towards multi-data source, with solution Certainly current Entity recognition scheme can not accurately carry out the problem of symptom and sign class Entity recognition.

In order to achieve the above objectives, the present invention adopts the following technical scheme:

A kind of symptom and sign class entity recognition method towards multi-data source, comprising:

Obtain the sentence to be processed in initial data；

The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed；

According to the CRF training pattern that preparatory training is completed, determine each text in sentence to be processed in sentence to be processed In entity indicia, and determine the entity indicia sequence of sentence to be processed；

According to the entity indicia sequence of sentence to be processed, first group of candidate's entity of sentence to be processed is determined；

According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to the sentence to be processed, really Fixed second group of candidate's entity；

According to the end character of candidate's entity each in first group of candidate's entity and second group of candidate's entity, to each candidate entity It is screened, is respectively formed first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity；

If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity be not identical, according to setting in advance The determination strategy set determines symptom body from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Levy class entity result.

Specifically, it is described according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of disease Symptom and sign class entity result is determined in shape sign class candidate's entity, comprising:

Determine sentence to be processed when carrying out term cutting, if to carry out cutting by pre-set segmentation rules；

If sentence to be processed carries out cutting when carrying out term cutting, through pre-set segmentation rules, then institute is selected The candidate entity in second group of symptom and sign class candidate's entity is stated as symptom and sign class entity result；

If sentence to be processed when carrying out term cutting, does not carry out cutting by pre-set segmentation rules, then selects Candidate entity in first group of symptom and sign class candidate's entity is as symptom and sign class entity result；

Alternatively, determine from identical sentence to be processed original character string first group of symptom and sign class candidate entity and In second group of symptom and sign class candidate's entity, entity number is few, and entity include number of characters more than a group object as symptom Sign class entity result；

Entity type in the symptom and sign class entity result includes symptom entity and sign entity；

The corresponding reality in first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity When the entity type of body is inconsistent, select the entity type of the entity in second group of candidate's entity as the corresponding entity Entity type.

Specifically, the initial data includes electronic health record data, clearing forms data, clinical research data, medical knowledge Library data, periodical literature data.

Specifically, determining each text in sentence to be processed wait locate according to the CRF training pattern that preparatory training is completed The entity indicia in sentence is managed, and determines the entity indicia sequence of sentence to be processed, comprising:

From the CRF statistical characteristics of each text extracted in pre-set corpus in sentence to be processed；It is described pre- Record has each sentence in initial data, the entity in each sentence and the entity in each sentence each in the corpus being first arranged Position and entity class in sentence；The CRF statistical characteristics include participle characteristic value of each text in each sentence, Part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value；

According to CRF statistical characteristics of each word in each sentence, a training pattern is determined；The training pattern are as follows:

According to the training pattern, the entity indicia y of each text in sentence to be processed is calculated_j；

The entity indicia of each text is combined, the entity indicia sequence of sentence to be processed is formed；Wherein, x is indicated The sentence to be processed；y_jIndicate the entity indicia of the corresponding text in the position j in sentence to be processed；f_i(y_j,y_j-1, x) indicate to Handle the functional value that feature i is segmented in sentence；λ_iFor model parameter；M indicates the number of participle feature；N indicates sentence to be processed In text point number；Z (x) indicates normalization factor；P (y | x) indicate marking probability of the text in sentence to be processed.

Specifically, determining first group of candidate's entity of sentence to be processed according to the entity indicia sequence of sentence to be processed, wrap It includes:

Determine the corresponding participle characteristic value of each text in entity indicia sequence, and according to participle characteristic value determination to Handle first group of candidate's entity of sentence.

Further, it is somebody's turn to do the symptom and sign class entity recognition method towards multi-data source, further includes:

It is not marked in pre-set corpus in the sentence to be processed, according to formula:

Determine the uncertain value of each entity in sentence to be processed；Its In, IE_kFor the uncertain value of k-th of entity；k_startFor the starting position of the entity indicia of k-th of entity；k_endIt is real for k-th The tail position of the entity indicia of body；For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed；

Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, if matching Success, then save the entity indicia of the entity of successful match；

Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label；

The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold The sentence to be processed of value is added in the corpus, to carry out corpus update；

Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed；

The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is total for the entity predicted in sentence to be processed The entity number in pre-set dictionary is appeared in number；B is the entity sum predicted in sentence to be processed.

Specifically, carrying out term to the sentence to be processed according to pre-set symptom and sign class term cutting strategy Cutting determines second group of candidate's entity, comprising:

Punctuation mark in sentence to be processed is converted into half-angle, and English alphabet is unified for capitalization English letter；

Pre-set non-medical term table is called, checks the original character string in sentence to be processed with the presence or absence of non-medical Term in nomenclature, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pretreated Sentence to be processed；

By pretreated sentence to be processed using reverse maximum match principle and pre-set symptom and sign database Matched, by pretreated sentence to be processed with the standard terminology title or synonym phase in symptom and sign database The character string matched is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as institute State the entity type of preliminary entity；

The original character string of pretreated sentence to be processed is matched with pre-set sentence pattern database；

If the sentence pattern in the original character string of the pretreated sentence to be processed and pre-set sentence pattern database The original character string of the pretreated sentence to be processed is then used reverse maximum match principle and set in advance by format match The disease ontology database set is matched, by in disease ontology database standard terminology title or synonym match Character string is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as described first Walk the entity type of entity；

Using each preliminary entity in pretreated sentence to be processed as second group of candidate's entity.

Specifically, according to the end character of candidate's entity each in first group of candidate's entity and second group of candidate's entity, to each Candidate entity is screened, and first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are respectively formed, Include:

Whether the end character for judging each candidate's entity in first group of candidate's entity and second group of candidate's entity is to set in advance The non-symptom and sign term character set；

If the end character of each candidate's entity is pre-set non-symptom and sign term character, the candidate entity is given up It abandons.

A kind of symptom and sign class entity recognition device towards multi-data source, comprising:

Sentence acquiring unit to be processed, for obtaining the sentence to be processed in initial data；

Individual character cutting unit determines each of sentence to be processed for the sentence to be processed to be carried out individual character cutting Text；

Entity indicia sequence determination unit, for determining sentence to be processed according to the CRF training pattern that training is completed in advance In entity indicia of each text in sentence to be processed, and determine the entity indicia sequence of sentence to be processed；

First group of candidate's entity determination unit determines language to be processed for the entity indicia sequence according to sentence to be processed First group of candidate's entity of sentence；

Second group of candidate's entity determination unit is used for according to pre-set symptom and sign class term cutting strategy, to institute It states sentence to be processed and carries out term cutting, determine second group of candidate's entity；

Candidate entity screening unit, for according to candidate's entity each in first group of candidate's entity and second group of candidate's entity End character screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group of symptom body Levy class candidate entity；

Symptom and sign class entity result determination unit, in first group of symptom and sign class candidate entity and second group of symptom When sign class candidate's entity is not identical, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second Symptom and sign class entity result is determined in group symptom and sign class candidate's entity.

Specifically, the symptom and sign class entity result determination unit, comprising:

Term cutting judgment module, for determining sentence to be processed when carrying out term cutting, if by presetting Segmentation rules carry out cutting；

Symptom and sign class entity result determining module is used in sentence to be processed when carrying out term cutting, by preparatory The segmentation rules of setting carry out cutting, then select the candidate entity in second group of symptom and sign class candidate's entity as symptom Sign class entity result；In sentence to be processed when carrying out term cutting, cutting is not carried out by pre-set segmentation rules, Then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result；

The symptom and sign class entity result determining module is also used to determine the original word for deriving from identical sentence to be processed In the first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity for according with string, entity number is few, and entity A group object more than the number of characters for including is as symptom and sign class entity result；Entity in the symptom and sign class entity result Type includes symptom entity and sign entity；

Entity type determining module, in first group of symptom and sign class candidate entity and second group of symptom and sign class When the entity type of corresponding entity is inconsistent in candidate entity, the entity type of the entity in second group of candidate's entity is selected Entity type as the corresponding entity.

Specifically, the initial data in the sentence acquiring unit to be processed include electronic health record data, clearing forms data, Clinical research data, medical knowledge base data, periodical literature data.

Further, the entity indicia sequence determination unit, comprising:

CRF statistical characteristics extraction module, for extracting each of sentence to be processed from pre-set corpus The CRF statistical characteristics of text；Record has each sentence in initial data, the reality in each sentence in the pre-set corpus Position and entity class of the entity in each sentence in body and each sentence；The CRF statistical characteristics includes each text Participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value of the word in each sentence；

Training pattern determining module determines a training mould for the CRF statistical characteristics according to each word in each sentence Type；The training pattern are as follows:

Entity indicia computing module, for calculating the reality of each text in sentence to be processed according to the training pattern Body marks y_j；

Entity indicia sequence determining module forms sentence to be processed for the entity indicia of each text to be combined Entity indicia sequence；Wherein, x indicates the sentence to be processed；y_jIndicate the reality of the corresponding text in the position j in sentence to be processed Body label；f_i(y_j,y_j-1, x) and indicate the functional value that feature i is segmented in sentence to be processed；λ_iFor model parameter；M indicates that participle is special The number of sign；N indicates the text point number in sentence to be processed；Z (x) indicates normalization factor；P (y | x) indicate that text exists Marking probability in sentence to be processed.

In addition, first group of candidate entity determination unit, is specifically used for:

Further, the symptom and sign class entity recognition device towards multi-data source further includes that corpus updates Unit is used for:

Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, are being matched When success, the entity indicia of the entity of successful match is saved；

In addition, second group of candidate entity determination unit, comprising:

Preprocessing module for the punctuation mark in sentence to be processed to be converted to half-angle, and English alphabet is unified for Capitalization English letter；Pre-set non-medical term table is called, checks that the original character string in sentence to be processed whether there is Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, form pre- place Sentence to be processed after reason；

Symptom and sign ontology library matching module, for pretreated sentence to be processed to be used reverse maximum match principle It is matched with pre-set symptom and sign database, it will be in pretreated sentence to be processed and in symptom and sign database Standard terminology title or the character string that matches of synonym extracted out as preliminary entity, and by the standard terminology title or same Entity type of the term type corresponding to adopted word as the preliminary entity；By the original word of pretreated sentence to be processed Symbol string is matched with pre-set sentence pattern database；If the original character string of the pretreated sentence to be processed and pre- Sentence pattern format match in the sentence pattern database being first arranged then adopts the original character string of the pretreated sentence to be processed Matched with reverse maximum match principle with pre-set disease ontology database, by with the mark in disease ontology database The character string that quasi- term name or synonym match is extracted out as preliminary entity, and by the standard terminology title or synonym Entity type of the corresponding term type as the preliminary entity；

Second group of candidate's entity determining module, for using each preliminary entity in pretreated sentence to be processed as Two groups of candidate's entities.

In addition, candidate's entity screening unit, comprising:

Non- symptom and sign term character judgement module, it is each in first group of candidate's entity and second group of candidate's entity for judging Whether the end character of candidate entity is pre-set non-symptom and sign term character；

Candidate entity gives up module, is pre-set non-symptom and sign term for the end character in each candidate entity When character, the candidate entity is given up.

A kind of symptom and sign class entity recognition method and device towards multi-data source provided in an embodiment of the present invention, it is first First, the sentence to be processed in initial data is obtained；The sentence to be processed is subjected to individual character cutting, is determined in sentence to be processed Each text；According to the CRF training pattern that preparatory training is completed, determine each text in sentence to be processed in sentence to be processed In entity indicia, and determine the entity indicia sequence of sentence to be processed；According to the entity indicia sequence of sentence to be processed, determine First group of candidate's entity of sentence to be processed；Then, according to pre-set symptom and sign class term cutting strategy, to it is described to It handles sentence and carries out term cutting, determine second group of candidate's entity；According in first group of candidate's entity and second group of candidate's entity The end character of each candidate's entity, screens each candidate entity, be respectively formed first group of symptom and sign class candidate entity and Second group of symptom and sign class candidate's entity；If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity It is not identical, it is candidate from first group of symptom and sign class candidate entity and second group of symptom and sign class according to pre-set determination strategy Symptom and sign class entity result is determined in entity.The present invention is by condition random field CRF statistical machine learning method and term cutting Method combines, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, real The problem of body identification inaccuracy.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.

Fig. 1 is a kind of process of the symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention Figure one；

Fig. 2 is a kind of process of the symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention The part A of figure two；

Fig. 3 is a kind of process of the symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention The part B of figure two；

Fig. 4 is a kind of structure of the symptom and sign class entity recognition device towards multi-data source provided in an embodiment of the present invention Schematic diagram one；

Fig. 5 is a kind of structure of the symptom and sign class entity recognition device towards multi-data source provided in an embodiment of the present invention Schematic diagram two.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

As shown in Figure 1, the embodiment of the present invention provides a kind of symptom and sign class entity recognition method towards multi-data source, packet It includes:

Sentence to be processed in step 101, acquisition initial data.

The sentence to be processed is carried out individual character cutting by step 102, determines each text in sentence to be processed.

Step 103, according to the CRF training pattern that training is completed in advance, determine each text in sentence to be processed to The entity indicia in sentence is handled, and determines the entity indicia sequence of sentence to be processed.

Step 104, the entity indicia sequence according to sentence to be processed determine first group of candidate's entity of sentence to be processed.

Step 105, according to pre-set symptom and sign class term cutting strategy, term is carried out to the sentence to be processed Cutting determines second group of candidate's entity.

Step 106, according in first group of candidate's entity and second group of candidate's entity it is each candidate entity end character, to each Candidate entity is screened, and first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are respectively formed.

If step 107, first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity be not identical, root It is true from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity according to pre-set determination strategy Determine symptom and sign class entity result.

A kind of symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention, firstly, obtaining Sentence to be processed in initial data；The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed Word；According to the CRF training pattern that preparatory training is completed, reality of each text in sentence to be processed in sentence to be processed is determined Body label, and determine the entity indicia sequence of sentence to be processed；According to the entity indicia sequence of sentence to be processed, determine to be processed First group of candidate's entity of sentence；Then, according to pre-set symptom and sign class term cutting strategy, to the language to be processed Sentence carries out term cutting, determines second group of candidate's entity；According to each candidate in first group of candidate's entity and second group of candidate's entity The end character of entity screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group Symptom and sign class candidate's entity；If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate entity not phase Together, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Middle determining symptom and sign class entity result.The present invention is by condition random field CRF statistical machine learning method and term cutting method Combine, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, entity know Not inaccurate problem.

In order to make those skilled in the art be better understood by the present invention, illustrate this hair below with reference to specific example It is bright.As shown in Figures 2 and 3 (wherein, Fig. 2 is a kind of part A of symptom and sign class entity recognition method towards multi-data source, Fig. 3 is a kind of part B of symptom and sign class entity recognition method towards multi-data source, is divided into A herein, part B is due to this The step of inventive embodiments, is more, not indicates the difference on practical significance, and part A and part B form entire step 201 to step Rapid 220, wherein figure 2 show step 201 to step 211, Fig. 3 shows step 212 to step 220.), the embodiment of the present invention A kind of symptom and sign class entity recognition method towards multi-data source is provided, comprising:

Sentence to be processed in step 201, acquisition initial data.

Specifically, the initial data includes symptom and sign clinical treatment data, symptom and sign research and development experimental data, symptom Sign sales data, symptom and sign scientific and technical literature data, symptom and sign electronic commerce data etc., but it is not only limited to this.

The sentence to be processed is carried out individual character cutting by step 202, determines each text in sentence to be processed.

For example, sentence to be processed is " dizzy aggravation before one week, with cough ", then after individual character cutting, each text are as follows: " one " " week " " preceding " " head " " dizzy " " adding " " play " ", " " companion " " cough " " coughing ".

Step 203, from the CRF statistical nature of each text extracted in pre-set corpus in sentence to be processed Value.

Record has each sentence in initial data, the entity in each sentence and each language in the pre-set corpus Position and entity class of the entity in each sentence in sentence；The CRF statistical characteristics includes each text in each sentence In participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value.

Pre-set corpus can be marked in advance by artificially, such as sentence:

" main suit: dizzy aggravation before one week, with cough.

Physical examination: without tenderness and rebound tenderness, 4 beats/min of gurgling sound."

Then for symptom and sign class entity, can mark out respectively:

C=dizziness P=1:6 1:7t=symptom

C=cough P=1:121:13t=symptom

C=tenderness P=2:42:5t=sign

C=rebound tenderness P=2:72:9t=sign

C=gurgling sound P=2:112:13t=sign

Wherein, c indicates that symptom and sign class entity, P indicate the line number and sentence of sentence in the corpus of symptom and sign class entity place Character position in son, t indicate that (symptom and sign entity class includes symptom entity and body to symptom and sign entity class in the present invention Levies in kind body).

For CRF statistical characteristics, such as sentence " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity indicia sequence It is classified as " OBEOBIEOBIEOOOO ".For example, CRF statistical nature is described as follows shown in table 1 for " pain " word:

Table 1:

Step 204, the CRF statistical characteristics according to each word in each sentence, determine a training pattern.

Wherein, the training pattern are as follows:

Step 205, according to the training pattern, calculate the entity indicia y of each text in sentence to be processed_j。

Wherein, x indicates the sentence to be processed；y_jIndicate the entity indicia of the corresponding text in the position j in sentence to be processed； f_i(y_j,y_j-1, x) and indicate the functional value that feature i is segmented in sentence to be processed；λ_iThe model parameter obtained for model parameter, training The sum of the training pattern p (y | x) of sentence can be made to reach maximum；M indicates the number of participle feature；N is indicated in sentence to be processed Text point number；Z (x) indicates normalization factor；P (y | x) indicate marking probability of the text in sentence to be processed.

For f_i(y_j,y_j-1, x), if indicating y_j、y_j-1, x be both present in corpus, then f_i(y_j,y_j-1, x)=1, otherwise It is 0.

The entity indicia of each text is combined by step 206, forms the entity indicia sequence of sentence to be processed.

Such as sentence " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity flag sequence is “OBEOBIEOBIEOOOO”。

Step 207 determines the corresponding participle characteristic value of each text in entity indicia sequence, and according to the participle feature Value determines first group of candidate's entity of sentence to be processed.

For example, entity flag sequence is for " no tenderness and rebound tenderness, 4 beats/min of gurgling sound " " OBEOBIEOBIEOOOO " therefore may recognize that first group of candidate's entity is " tenderness ", " rebound tenderness ", " gurgling sound ".

Punctuation mark in sentence to be processed is converted to half-angle, and English alphabet is unified for capitalization English by step 208 Letter.

Step 209 calls pre-set non-medical term table, checks whether the original character string in sentence to be processed is deposited Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pre- Treated sentence to be processed.

Pretreated sentence to be processed is used reverse maximum match principle and pre-set symptom body by step 210 Sign database is matched, by pretreated sentence to be processed with standard terminology title in symptom and sign database or same The character string that adopted word matches is extracted out as preliminary entity, and by term class corresponding to the standard terminology title or synonym Entity type of the type as the preliminary entity.

Herein, pre-set symptom and sign database may include symptom and sign tables of data as shown in table 2 below, the disease Shape sign data table can be expand on the basis of international ICD10 and authoritative medical tool book it is built-up, wherein including Concept category between word and word between synonymy, word and word divides relationship etc., is embodied in standard terminology title in table, same Adopted word, hypernym etc..

Table 2:

Standard terminology title	Synonym	Hypernym title	Term type
				Pain	Symptom
Headache		Pain	Symptom
				Tenderness	Sign
Blood pressure			Sign
				Heart rate	Sign

Step 211 carries out the original character string of pretreated sentence to be processed and pre-set sentence pattern database Matching.

If the original character string of step 212, the pretreated sentence to be processed and pre-set sentence pattern database In sentence pattern format match, then by the original character string of the pretreated sentence to be processed use reverse maximum match principle Matched with pre-set disease ontology database, by with the standard terminology title or synonym in disease ontology database The character string to match is extracted out as preliminary entity, and term type corresponding to the standard terminology title or synonym is made For the entity type of the preliminary entity.

Step 211 and step 212 are to be possible to not be extracted there are also entity in order to avoid remaining character string to fetch herein, Therefore it needs further judgement and extracts.

Pre-set sentence pattern database herein may include sentence pattern tables of data as shown in table 3 below:

Table 3:

Step 213, using each preliminary entity in pretreated sentence to be processed as second group of candidate's entity.

Through the above steps 210 and step 212 specific rules, second group of final candidate's entity can be formed.

Step 214, judge in first group of candidate's entity and second group of candidate's entity it is each candidate entity end character whether For pre-set non-symptom and sign term character.

The pre-set non-symptom and sign term character can be such as " medicine, operation, art, inspection " etc..

If the end character of step 215, each candidate entity is pre-set non-symptom and sign term character, by the time Entity is selected to give up.

After step 215, step 216 or step 219 are executed.

Step 216, in first group of symptom and sign class candidate entity and not identical second group of symptom and sign class candidate's entity, Determine sentence to be processed when carrying out term cutting, if to carry out cutting by pre-set segmentation rules.

I.e. whether through the above steps 211 and step 212 processing.

After step 216, step 217 or step 218 are executed.

If step 217, sentence to be processed carry out cutting when carrying out term cutting, through pre-set segmentation rules, Then select the candidate entity in second group of symptom and sign class candidate's entity as symptom and sign class entity result.

If step 218, sentence to be processed are not cut by pre-set segmentation rules when carrying out term cutting Point, then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result.

Step 219, in first group of symptom and sign class candidate entity and not identical second group of symptom and sign class candidate's entity, Determine the first group of symptom and sign class candidate entity and second group of symptom body of the original character string from identical sentence to be processed Levy class candidate entity in, entity number is few, and entity include number of characters more than a group object as symptom and sign class entity knot Fruit.

For example, initial data is " showing as abdominal distention ".

First group of symptom and sign class candidate's entity is " abdominal distention [symptom] "；

Second group of symptom and sign class candidate's entity is " bulge [symptom] "；

Then, final result is " abdominal distention [symptom] ".

After step 217,218 and step 219, step 220 is executed.

Step 220, the phase in first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity When the entity type of corresponding entity is inconsistent, select the entity type of the entity in second group of candidate's entity as described opposite The entity type for the entity answered.

For example, initial data is " having peritonitis performance therewith ".

First group of symptom and sign class candidate's entity is " peritonitis [disease] "；

Second group of symptom and sign class candidate's entity is " peritonitis [symptom] "；

Then, final result is " peritonitis [symptom] ".

201 to step 220 through the above steps, may finally obtain symptom and sign class Entity recognition result.

In addition, being updated to realize to corpus, new sentence pattern feature can be found by manually summarizing, and manually mark Note is added in corpus；Furthermore it is also possible to be not marked in pre-set corpus in the sentence to be processed, according to Formula:Determine the uncertain value of each entity in sentence to be processed；Its In, IE_kFor the uncertain value of k-th of entity；k_startFor the starting position of the entity indicia of k-th of entity；k_endIt is real for k-th The tail position of the entity indicia of body；For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed.

For example, " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity indicia sequence is " OBEOBIEOBIEOOOO ", position Setting sequence is " 0123456789 10 11 12 13 14 ", it will be seen that entity " tenderness ", position are " 12 ", therefore, K_startIt is 1, K_endIt is 2.

Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, if matching Success, then save the entity indicia of the entity of successful match.

Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label.

The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold The sentence to be processed of value is added in the corpus, to carry out corpus update.

Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed.

The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is the entity predicted in sentence to be processed The entity number in pre-set dictionary is appeared in sum；B is the entity sum predicted in sentence to be processed.

As it can be seen that corpus data needed for Entity recognition may be implemented and utilize semi-supervised self study side by the update of corpus Method realizes that corpus is enriched constantly, solves the problems, such as corpus number deficiency, incomplete.

Corresponding to above-mentioned Fig. 1, Fig. 2 and embodiment of the method shown in Fig. 3, as shown in figure 4, the embodiment of the present invention provides one kind Symptom and sign class entity recognition device towards multi-data source, comprising:

Sentence acquiring unit 31 to be processed, for obtaining the sentence to be processed in initial data.

Individual character cutting unit 32 determines every in sentence to be processed for the sentence to be processed to be carried out individual character cutting A text.

Entity indicia sequence determination unit 33, for determining language to be processed according to the CRF training pattern that training is completed in advance Entity indicia of each text in sentence to be processed in sentence, and determine the entity indicia sequence of sentence to be processed.

First group of candidate's entity determination unit 34 determines to be processed for the entity indicia sequence according to sentence to be processed First group of candidate's entity of sentence.

Second group of candidate's entity determination unit 35 is used for according to pre-set symptom and sign class term cutting strategy, right The sentence to be processed carries out term cutting, determines second group of candidate's entity.

Candidate entity screening unit 36, for according to candidate's entity each in first group of candidate's entity and second group of candidate's entity End character, each candidate entity is screened, first group of symptom and sign class candidate entity and second group of symptom are respectively formed Sign class candidate's entity.

Symptom and sign class entity result determination unit 37, in first group of symptom and sign class candidate entity and second group of disease When shape sign class candidate's entity is not identical, according to pre-set determination strategy from first group of symptom and sign class candidate entity and Symptom and sign class entity result is determined in two groups of symptom and sign class candidate's entities.

Specifically, as shown in figure 5, the symptom and sign class entity result determination unit 37, comprising:

Term cutting judgment module 371, for determining sentence to be processed when carrying out term cutting, if by setting in advance The segmentation rules set carry out cutting.

Symptom and sign class entity result determining module 372 is used in sentence to be processed when carrying out term cutting, by pre- The segmentation rules being first arranged carry out cutting, then select the candidate entity in second group of symptom and sign class candidate's entity as disease Shape sign class entity result；In sentence to be processed when carrying out term cutting, do not cut by pre-set segmentation rules Point, then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result.

The symptom and sign class entity result determining module 372 is also used to determine the original for deriving from identical sentence to be processed In first group of symptom and sign class candidate entity of beginning character string and second group of symptom and sign class candidate's entity, entity number is few, and A group object more than the number of characters that entity includes is as symptom and sign class entity result；In the symptom and sign class entity result Entity type includes symptom entity and sign entity.

Entity type determining module 373, in first group of symptom and sign class candidate entity and second group of symptom body When the entity type of corresponding entity is inconsistent in sign class candidate entity, the entity of the entity in second group of candidate's entity is selected Entity type of the type as the corresponding entity.

Specifically, the initial data in the sentence acquiring unit 31 to be processed includes electronic health record data, clearing odd number According to, clinical research data, medical knowledge base data, periodical literature data.

Further, as shown in figure 5, the entity indicia sequence determination unit 33, comprising:

CRF statistical characteristics extraction module 331, for every in sentence to be processed from being extracted in pre-set corpus The CRF statistical characteristics of a text；Record has each sentence in initial data, in each sentence in the pre-set corpus Position and entity class of the entity in each sentence in entity and each sentence；The CRF statistical characteristics includes each Participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature feature of the text in each sentence Value.

Training pattern determining module 332 determines an instruction for the CRF statistical characteristics according to each word in each sentence Practice model；The training pattern are as follows:

Entity indicia computing module 333, for calculating each text in sentence to be processed according to the training pattern Entity indicia y_j。

Entity indicia sequence determining module 334 forms language to be processed for the entity indicia of each text to be combined The entity indicia sequence of sentence；Wherein, x indicates the sentence to be processed；y_jIndicate the corresponding text in the position j in sentence to be processed Entity indicia；f_i(y_j,y_j-1, x) and indicate the functional value that feature i is segmented in sentence to be processed；λ_iFor model parameter；M indicates participle The number of feature；N indicates the text point number in sentence to be processed；Z (x) indicates normalization factor；P (y | x) indicate text Marking probability in sentence to be processed.

In addition, first group of candidate's entity determination unit 34, is specifically used for:

Further, as shown in figure 5, the symptom and sign class entity recognition device towards multi-data source, further includes Corpus updating unit 38 is used for:

Determine the uncertain value of each entity in sentence to be processed；Its In, IE_kFor the uncertain value of k-th of entity；k_startFor the starting position of the entity indicia of k-th of entity；k_endIt is real for k-th The tail position of the entity indicia of body；For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed.

Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, are being matched When success, the entity indicia of the entity of successful match is saved.

In addition, as shown in figure 5, second group of candidate's entity determination unit 35, comprising:

Preprocessing module 351, for the punctuation mark in sentence to be processed to be converted to half-angle, and English alphabet is unified For capitalization English letter；Pre-set non-medical term table is called, checks whether the original character string in sentence to be processed is deposited Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pre- Treated sentence to be processed.

Symptom and sign ontology library matching module 352, for matching pretreated sentence to be processed using reverse maximum Principle is matched with pre-set symptom and sign database, by pretreated sentence to be processed with symptom and sign data The character string that standard terminology title or synonym in library match is extracted out as preliminary entity, and by the standard terminology title Or entity type of the term type corresponding to synonym as the preliminary entity；By the original of pretreated sentence to be processed Beginning character string is matched with pre-set sentence pattern database；If the original character string of the pretreated sentence to be processed With the sentence pattern format match in pre-set sentence pattern database, then by the original character of the pretreated sentence to be processed String is matched with pre-set disease ontology database using reverse maximum match principle, will in disease ontology database Standard terminology title or the character string that matches of synonym extracted out as preliminary entity, and by the standard terminology title or same Entity type of the term type corresponding to adopted word as the preliminary entity.

Second group of candidate's entity determining module 353, for making each preliminary entity in pretreated sentence to be processed For second group of candidate's entity.

In addition, as shown in figure 5, candidate's entity screening unit 36, comprising:

Non- symptom and sign term character judgement module 361, for judging first group of candidate's entity and second group of candidate's entity In the end character of each candidate entity whether be pre-set non-symptom and sign term character.

Candidate entity gives up module 362, is pre-set non-symptom and sign for the end character in each candidate entity When term character, the candidate entity is given up.

It is worth noting that a kind of symptom and sign class Entity recognition dress towards multi-data source provided in an embodiment of the present invention The specific implementation set may refer to above-mentioned embodiment of the method, and details are not described herein again.

A kind of symptom and sign class entity recognition device towards multi-data source provided in an embodiment of the present invention, firstly, obtaining Sentence to be processed in initial data；The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed Word；According to the CRF training pattern that preparatory training is completed, reality of each text in sentence to be processed in sentence to be processed is determined Body label, and determine the entity indicia sequence of sentence to be processed；According to the entity indicia sequence of sentence to be processed, determine to be processed First group of candidate's entity of sentence；Then, according to pre-set symptom and sign class term cutting strategy, to the language to be processed Sentence carries out term cutting, determines second group of candidate's entity；According to each candidate in first group of candidate's entity and second group of candidate's entity The end character of entity screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group Symptom and sign class candidate's entity；If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate entity not phase Together, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Middle determining symptom and sign class entity result.The present invention is by condition random field CRF statistical machine learning method and term cutting method Combine, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, entity know Not inaccurate problem.

It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Specific embodiment is applied in the present invention, and principle and implementation of the present invention are described, above embodiments Explanation be merely used to help understand method and its core concept of the invention；At the same time, for those skilled in the art, According to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion in this specification Appearance should not be construed as limiting the invention.

Claims

1. a kind of symptom and sign class entity recognition method towards multi-data source characterized by comprising

Obtain the sentence to be processed in initial data；

According to the CRF training pattern that preparatory training is completed, determine each text in sentence to be processed in sentence to be processed Entity indicia, and determine the entity indicia sequence of sentence to be processed；

According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to the sentence to be processed, determines the Two groups of candidate's entities；

According to the end character of candidate's entity each in first group of candidate's entity and second group of candidate's entity, each candidate entity is carried out Screening, is respectively formed first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity；

If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity be not identical, according to pre-set Determination strategy determines symptom and sign class from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Entity result；

It is described to be waited according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class It selects and determines symptom and sign class entity result in entity, comprising:

If sentence to be processed carries out cutting when carrying out term cutting, through pre-set segmentation rules, then described the is selected Candidate entity in two groups of symptom and sign class candidate's entities is as symptom and sign class entity result；

If sentence to be processed when carrying out term cutting, does not carry out cutting by pre-set segmentation rules, then described in selection Candidate entity in first group of symptom and sign class candidate's entity is as symptom and sign class entity result；

Alternatively, determining first group of symptom and sign class candidate entity and second of the original character string from identical sentence to be processed Group symptom and sign class candidate's entity in, entity number is few, and entity include number of characters more than a group object as symptom and sign Class entity result；

The corresponding entity in first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity When entity type is inconsistent, select the entity type of the entity in second group of candidate's entity as the reality of the corresponding entity Body type；

According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to the sentence to be processed, determines the Two groups of candidate's entities, comprising:

Pre-set non-medical term table is called, checks the original character string in sentence to be processed with the presence or absence of non-medical term Term in table, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pretreated wait locate Manage sentence；

Pretreated sentence to be processed is carried out using reverse maximum match principle and pre-set symptom and sign database Matching, by pretreated sentence to be processed in symptom and sign database standard terminology title or synonym match Character string is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as described first Walk the entity type of entity；

If the sentence pattern format in the original character string of the pretreated sentence to be processed and pre-set sentence pattern database Matching, then by the original character string of the pretreated sentence to be processed using reverse maximum match principle with it is pre-set Disease ontology database is matched, by with the standard terminology title or the character that matches of synonym in disease ontology database String is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as the preliminary reality The entity type of body；

2. the symptom and sign class entity recognition method according to claim 1 towards multi-data source, which is characterized in that described Initial data includes electronic health record data, clearing forms data, clinical research data, medical knowledge base data, periodical literature data.

3. the symptom and sign class entity recognition method according to claim 2 towards multi-data source, which is characterized in that according to The CRF training pattern that training is completed in advance, determines entity indicia of each text in sentence to be processed in sentence to be processed, And determine the entity indicia sequence of sentence to be processed, comprising:

From the CRF statistical characteristics of each text extracted in pre-set corpus in sentence to be processed；It is described to set in advance Record has each sentence in initial data, the entity in each sentence and the entity in each sentence in each sentence in the corpus set In position and entity class；The CRF statistical characteristics includes participle characteristic value, part of speech of each text in each sentence Characteristic value, character feature value, contextual feature value and nomenclature characteristic value；

The entity indicia of each text is combined, the entity indicia sequence of sentence to be processed is formed；Wherein, described in x expression Sentence to be processed；y_jIndicate the entity indicia of the corresponding text in the position j in sentence to be processed；f_i(y_j,y_j-1, x) and indicate to be processed The functional value of feature i is segmented in sentence；λ_iFor model parameter；M indicates the number of participle feature；N is indicated in sentence to be processed Text point number；Z (x) indicates normalization factor；P (y | x) indicate marking probability of the text in sentence to be processed.

4. the symptom and sign class entity recognition method according to claim 3 towards multi-data source, which is characterized in that according to The entity indicia sequence of sentence to be processed determines first group of candidate's entity of sentence to be processed, comprising:

The corresponding participle characteristic value of each text is determined in entity indicia sequence, and is determined according to the participle characteristic value to be processed First group of candidate's entity of sentence.

5. the symptom and sign class entity recognition method according to claim 4 towards multi-data source, which is characterized in that also wrap It includes:

Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, if successful match, Then the entity indicia of the entity of successful match is saved；

The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold value Sentence to be processed is added in the corpus, to carry out corpus update；

The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is in the entity sum predicted in sentence to be processed Appear in the entity number in pre-set dictionary；B is the entity sum predicted in sentence to be processed.

6. the symptom and sign class entity recognition method according to claim 5 towards multi-data source, which is characterized in that according to The end character of each candidate's entity in first group of candidate's entity and second group of candidate's entity, screens each candidate entity, point First group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are not formed, comprising:

Whether the end character for judging each candidate's entity in first group of candidate's entity and second group of candidate's entity is pre-set Non- symptom and sign term character；

If the end character of each candidate's entity is pre-set non-symptom and sign term character, the candidate entity is given up.

7. a kind of symptom and sign class entity recognition device towards multi-data source characterized by comprising

Individual character cutting unit determines each text in sentence to be processed for the sentence to be processed to be carried out individual character cutting；

Entity indicia sequence determination unit, for determining in sentence to be processed according to the CRF training pattern that training is completed in advance Entity indicia of each text in sentence to be processed, and determine the entity indicia sequence of sentence to be processed；

First group of candidate's entity determination unit determines sentence to be processed for the entity indicia sequence according to sentence to be processed First group of candidate's entity；

Second group of candidate's entity determination unit, for according to pre-set symptom and sign class term cutting strategy, to it is described to It handles sentence and carries out term cutting, determine second group of candidate's entity；

Candidate entity screening unit, for the end according to candidate's entity each in first group of candidate's entity and second group of candidate's entity Character screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group of symptom and sign class Candidate entity；

Symptom and sign class entity result determination unit, in first group of symptom and sign class candidate entity and second group of symptom and sign When class candidate's entity is not identical, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of disease Symptom and sign class entity result is determined in shape sign class candidate's entity；

The symptom and sign class entity result determination unit, comprising:

Term cutting judgment module, for determining sentence to be processed when carrying out term cutting, if cut by pre-set Divider then carries out cutting；

Symptom and sign class entity result determining module is used in sentence to be processed when carrying out term cutting, by presetting Segmentation rules carry out cutting, then select the candidate entity in second group of symptom and sign class candidate's entity as symptom and sign Class entity result；In sentence to be processed when carrying out term cutting, cutting is not carried out by pre-set segmentation rules, then is selected The candidate entity in first group of symptom and sign class candidate's entity is selected as symptom and sign class entity result；Alternatively, for true Surely from the first group of symptom and sign class candidate entity and second group of symptom and sign of the original character string of identical sentence to be processed In class candidate's entity, entity number is few, and entity include number of characters more than a group object as symptom and sign class entity result； Entity type in the symptom and sign class entity result includes symptom entity and sign entity；

Entity type determining module, for candidate in first group of symptom and sign class candidate entity and second group of symptom and sign class When the entity type of corresponding entity is inconsistent in entity, select the entity type of the entity in second group of candidate's entity as The entity type of the corresponding entity；

Second group of candidate entity determination unit, comprising:

English alphabet for the punctuation mark in sentence to be processed to be converted to half-angle, and is unified for capitalization by preprocessing module English alphabet；Pre-set non-medical term table is called, checks the original character string in sentence to be processed with the presence or absence of non-doctor Term in technics table, and the term in non-medical term table present in sentence to be processed is deleted, after forming pretreatment Sentence to be processed；

Symptom and sign ontology library matching module, for will pretreated sentence to be processed using reverse maximum match principle in advance The symptom and sign database being first arranged is matched, by pretreated sentence to be processed with the mark in symptom and sign database The character string that quasi- term name or synonym match is extracted out as preliminary entity, and by the standard terminology title or synonym Entity type of the corresponding term type as the preliminary entity；By the original character string of pretreated sentence to be processed It is matched with pre-set sentence pattern database；If the original character string of the pretreated sentence to be processed with set in advance Sentence pattern format match in the sentence pattern database set, then by the original character string of the pretreated sentence to be processed using inverse Matched to maximum match principle with pre-set disease ontology database, by with the standard art in disease ontology database The character string that language title or synonym match is extracted out as preliminary entity, and the standard terminology title or synonym institute is right Entity type of the term type answered as the preliminary entity；

Second group of candidate's entity determining module, for using each preliminary entity in pretreated sentence to be processed as second group Candidate entity.

8. the symptom and sign class entity recognition device according to claim 7 towards multi-data source, which is characterized in that described Initial data in sentence acquiring unit to be processed includes that electronic health record data, clearing forms data, clinical research data, medicine are known Know library data, periodical literature data.

9. the symptom and sign class entity recognition device according to claim 8 towards multi-data source, which is characterized in that described Entity indicia sequence determination unit, comprising:

CRF statistical characteristics extraction module, for from each text extracted in pre-set corpus in sentence to be processed CRF statistical characteristics；In the pre-set corpus record have each sentence in initial data, the entity in each sentence, And position and entity class of the entity in each sentence in each sentence；The CRF statistical characteristics includes each text Participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value in each sentence；

Training pattern determining module determines a training pattern for the CRF statistical characteristics according to each word in each sentence； The training pattern are as follows:

Entity indicia computing module, for calculating the entity mark of each text in sentence to be processed according to the training pattern Remember y_j；

Entity indicia sequence determining module forms the reality of sentence to be processed for the entity indicia of each text to be combined Body flag sequence；Wherein, x indicates the sentence to be processed；y_jIndicate the entity mark of the corresponding text in the position j in sentence to be processed Note；f_i(y_j,y_j-1, x) and indicate the functional value that feature i is segmented in sentence to be processed；λ_iFor model parameter；M indicates participle feature Number；N indicates the text point number in sentence to be processed；Z (x) indicates normalization factor；P (y | x) indicate text wait locate Manage the marking probability in sentence.

10. the symptom and sign class entity recognition device according to claim 9 towards multi-data source, which is characterized in that institute First group of candidate's entity determination unit is stated, is specifically used for:

11. the symptom and sign class entity recognition device according to claim 10 towards multi-data source, which is characterized in that also Including corpus updating unit, it is used for:

Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, in successful match When, the entity indicia of the entity of successful match is saved；

12. the symptom and sign class entity recognition device according to claim 7 towards multi-data source, which is characterized in that institute State candidate entity screening unit, comprising:

Non- symptom and sign term character judgement module, for judging each candidate in first group of candidate's entity and second group of candidate's entity Whether the end character of entity is pre-set non-symptom and sign term character；

Candidate entity gives up module, is pre-set non-symptom and sign term character for the end character in each candidate entity When, the candidate entity is given up.