CN106897559B - A kind of symptom and sign class entity recognition method and device towards multi-data source - Google Patents

A kind of symptom and sign class entity recognition method and device towards multi-data source Download PDF

Info

Publication number
CN106897559B
CN106897559B CN201710103706.4A CN201710103706A CN106897559B CN 106897559 B CN106897559 B CN 106897559B CN 201710103706 A CN201710103706 A CN 201710103706A CN 106897559 B CN106897559 B CN 106897559B
Authority
CN
China
Prior art keywords
entity
sentence
processed
symptom
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710103706.4A
Other languages
Chinese (zh)
Other versions
CN106897559A (en
Inventor
李雪莉
关毅
黄玉丽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yi Bao Interconnected Medical Information Technology (Beijing) Co., Ltd.
Harbin Institute of Technology
Original Assignee
Heilongjiang Teshi Information Technology Co Ltd
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Heilongjiang Teshi Information Technology Co Ltd, Harbin Institute of Technology filed Critical Heilongjiang Teshi Information Technology Co Ltd
Priority to CN201710103706.4A priority Critical patent/CN106897559B/en
Publication of CN106897559A publication Critical patent/CN106897559A/en
Application granted granted Critical
Publication of CN106897559B publication Critical patent/CN106897559B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H50/00ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
    • G16H50/20ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for computer-aided diagnosis, e.g. based on medical expert systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Public Health (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Pathology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Epidemiology (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of symptom and sign class entity recognition method and device towards multi-data source, is related to medical bodies identification technology field.Method includes: the sentence to be processed obtained in initial data;Sentence to be processed is subjected to individual character cutting, determines each text;According to the CRF training pattern that preparatory training is completed, entity indicia of each text in sentence to be processed in sentence to be processed is determined, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, first group of candidate's entity of sentence to be processed is determined;According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to sentence to be processed, determines second group of candidate's entity;Each candidate entity is screened, first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are respectively formed;Symptom and sign class entity result is determined according to pre-set determination strategy.

Description

A kind of symptom and sign class entity recognition method and device towards multi-data source
Technical field
The present invention relates to medical bodies identification technology field more particularly to a kind of symptom and sign class towards multi-data source are real Body recognition methods and device.
Background technique
Currently, with the development of network and medical information technology, Chinese population gradually tend to astogeny, internet medical treatment by It gradually rises, people are higher and higher to demand for medical service level.And this contradiction also between the relative shortage of medical resource is got over Invention is aobvious.It realizes intelligent diagnostics and the treatment of disease, be unable to do without and identify disease and its symptom and sign from medical big data Corresponding relationship, this process is symptom and sign Entity recognition process.
In recent years, the important step as the analysis of medical treatment & health data, medical bodies identification (such as symptom and sign class Entity recognition) medical terms present in related text can be extracted, the performance of follow-up study is played an important role.Mesh Preceding common entity recognition techniques have medicine Entity recognition based on vocabulary and based on condition random field (Conditional Random Fields, abbreviation CRF) medicine Entity recognition, however the medicine Entity recognition based on vocabulary relies solely on terminology bank Matching lacks context of co-text identification, and term storehouse matching exists compared with big limitation.And the medicine Entity recognition skill based on CRF Art lacks the application of big data corpus and language rule, and corpus is the corpus after artificial mark, semi-supervised without utilizing The methods of study, the use for increasing the unlabeled data huger to quantity lack so that model is incomplete based on linguistics With the rule of medical information, model is relied solely on, it is strong to the less pertinence of data.As it can be seen that current Entity recognition scheme is simultaneously Symptom and sign class Entity recognition cannot accurately be carried out.
Summary of the invention
The embodiment of the present invention provides a kind of symptom and sign class entity recognition method and device towards multi-data source, with solution Certainly current Entity recognition scheme can not accurately carry out the problem of symptom and sign class Entity recognition.
In order to achieve the above objectives, the present invention adopts the following technical scheme:
A kind of symptom and sign class entity recognition method towards multi-data source, comprising:
Obtain the sentence to be processed in initial data;
The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed;
According to the CRF training pattern that preparatory training is completed, determine each text in sentence to be processed in sentence to be processed In entity indicia, and determine the entity indicia sequence of sentence to be processed;
According to the entity indicia sequence of sentence to be processed, first group of candidate's entity of sentence to be processed is determined;
According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to the sentence to be processed, really Fixed second group of candidate's entity;
According to the end character of candidate's entity each in first group of candidate's entity and second group of candidate's entity, to each candidate entity It is screened, is respectively formed first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity;
If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity be not identical, according to setting in advance The determination strategy set determines symptom body from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Levy class entity result.
Specifically, it is described according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of disease Symptom and sign class entity result is determined in shape sign class candidate's entity, comprising:
Determine sentence to be processed when carrying out term cutting, if to carry out cutting by pre-set segmentation rules;
If sentence to be processed carries out cutting when carrying out term cutting, through pre-set segmentation rules, then institute is selected The candidate entity in second group of symptom and sign class candidate's entity is stated as symptom and sign class entity result;
If sentence to be processed when carrying out term cutting, does not carry out cutting by pre-set segmentation rules, then selects Candidate entity in first group of symptom and sign class candidate's entity is as symptom and sign class entity result;
Alternatively, determine from identical sentence to be processed original character string first group of symptom and sign class candidate entity and In second group of symptom and sign class candidate's entity, entity number is few, and entity include number of characters more than a group object as symptom Sign class entity result;
Entity type in the symptom and sign class entity result includes symptom entity and sign entity;
The corresponding reality in first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity When the entity type of body is inconsistent, select the entity type of the entity in second group of candidate's entity as the corresponding entity Entity type.
Specifically, the initial data includes electronic health record data, clearing forms data, clinical research data, medical knowledge Library data, periodical literature data.
Specifically, determining each text in sentence to be processed wait locate according to the CRF training pattern that preparatory training is completed The entity indicia in sentence is managed, and determines the entity indicia sequence of sentence to be processed, comprising:
From the CRF statistical characteristics of each text extracted in pre-set corpus in sentence to be processed;It is described pre- Record has each sentence in initial data, the entity in each sentence and the entity in each sentence each in the corpus being first arranged Position and entity class in sentence;The CRF statistical characteristics include participle characteristic value of each text in each sentence, Part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value;
According to CRF statistical characteristics of each word in each sentence, a training pattern is determined;The training pattern are as follows:
According to the training pattern, the entity indicia y of each text in sentence to be processed is calculatedj
The entity indicia of each text is combined, the entity indicia sequence of sentence to be processed is formed;Wherein, x is indicated The sentence to be processed;yjIndicate the entity indicia of the corresponding text in the position j in sentence to be processed;fi(yj,yj-1, x) indicate to Handle the functional value that feature i is segmented in sentence;λiFor model parameter;M indicates the number of participle feature;N indicates sentence to be processed In text point number;Z (x) indicates normalization factor;P (y | x) indicate marking probability of the text in sentence to be processed.
Specifically, determining first group of candidate's entity of sentence to be processed according to the entity indicia sequence of sentence to be processed, wrap It includes:
Determine the corresponding participle characteristic value of each text in entity indicia sequence, and according to participle characteristic value determination to Handle first group of candidate's entity of sentence.
Further, it is somebody's turn to do the symptom and sign class entity recognition method towards multi-data source, further includes:
It is not marked in pre-set corpus in the sentence to be processed, according to formula:
Determine the uncertain value of each entity in sentence to be processed;Its In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed;
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, if matching Success, then save the entity indicia of the entity of successful match;
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label;
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold The sentence to be processed of value is added in the corpus, to carry out corpus update;
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed;
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is total for the entity predicted in sentence to be processed The entity number in pre-set dictionary is appeared in number;B is the entity sum predicted in sentence to be processed.
Specifically, carrying out term to the sentence to be processed according to pre-set symptom and sign class term cutting strategy Cutting determines second group of candidate's entity, comprising:
Punctuation mark in sentence to be processed is converted into half-angle, and English alphabet is unified for capitalization English letter;
Pre-set non-medical term table is called, checks the original character string in sentence to be processed with the presence or absence of non-medical Term in nomenclature, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pretreated Sentence to be processed;
By pretreated sentence to be processed using reverse maximum match principle and pre-set symptom and sign database Matched, by pretreated sentence to be processed with the standard terminology title or synonym phase in symptom and sign database The character string matched is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as institute State the entity type of preliminary entity;
The original character string of pretreated sentence to be processed is matched with pre-set sentence pattern database;
If the sentence pattern in the original character string of the pretreated sentence to be processed and pre-set sentence pattern database The original character string of the pretreated sentence to be processed is then used reverse maximum match principle and set in advance by format match The disease ontology database set is matched, by in disease ontology database standard terminology title or synonym match Character string is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as described first Walk the entity type of entity;
Using each preliminary entity in pretreated sentence to be processed as second group of candidate's entity.
Specifically, according to the end character of candidate's entity each in first group of candidate's entity and second group of candidate's entity, to each Candidate entity is screened, and first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are respectively formed, Include:
Whether the end character for judging each candidate's entity in first group of candidate's entity and second group of candidate's entity is to set in advance The non-symptom and sign term character set;
If the end character of each candidate's entity is pre-set non-symptom and sign term character, the candidate entity is given up It abandons.
A kind of symptom and sign class entity recognition device towards multi-data source, comprising:
Sentence acquiring unit to be processed, for obtaining the sentence to be processed in initial data;
Individual character cutting unit determines each of sentence to be processed for the sentence to be processed to be carried out individual character cutting Text;
Entity indicia sequence determination unit, for determining sentence to be processed according to the CRF training pattern that training is completed in advance In entity indicia of each text in sentence to be processed, and determine the entity indicia sequence of sentence to be processed;
First group of candidate's entity determination unit determines language to be processed for the entity indicia sequence according to sentence to be processed First group of candidate's entity of sentence;
Second group of candidate's entity determination unit is used for according to pre-set symptom and sign class term cutting strategy, to institute It states sentence to be processed and carries out term cutting, determine second group of candidate's entity;
Candidate entity screening unit, for according to candidate's entity each in first group of candidate's entity and second group of candidate's entity End character screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group of symptom body Levy class candidate entity;
Symptom and sign class entity result determination unit, in first group of symptom and sign class candidate entity and second group of symptom When sign class candidate's entity is not identical, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second Symptom and sign class entity result is determined in group symptom and sign class candidate's entity.
Specifically, the symptom and sign class entity result determination unit, comprising:
Term cutting judgment module, for determining sentence to be processed when carrying out term cutting, if by presetting Segmentation rules carry out cutting;
Symptom and sign class entity result determining module is used in sentence to be processed when carrying out term cutting, by preparatory The segmentation rules of setting carry out cutting, then select the candidate entity in second group of symptom and sign class candidate's entity as symptom Sign class entity result;In sentence to be processed when carrying out term cutting, cutting is not carried out by pre-set segmentation rules, Then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result;
The symptom and sign class entity result determining module is also used to determine the original word for deriving from identical sentence to be processed In the first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity for according with string, entity number is few, and entity A group object more than the number of characters for including is as symptom and sign class entity result;Entity in the symptom and sign class entity result Type includes symptom entity and sign entity;
Entity type determining module, in first group of symptom and sign class candidate entity and second group of symptom and sign class When the entity type of corresponding entity is inconsistent in candidate entity, the entity type of the entity in second group of candidate's entity is selected Entity type as the corresponding entity.
Specifically, the initial data in the sentence acquiring unit to be processed include electronic health record data, clearing forms data, Clinical research data, medical knowledge base data, periodical literature data.
Further, the entity indicia sequence determination unit, comprising:
CRF statistical characteristics extraction module, for extracting each of sentence to be processed from pre-set corpus The CRF statistical characteristics of text;Record has each sentence in initial data, the reality in each sentence in the pre-set corpus Position and entity class of the entity in each sentence in body and each sentence;The CRF statistical characteristics includes each text Participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value of the word in each sentence;
Training pattern determining module determines a training mould for the CRF statistical characteristics according to each word in each sentence Type;The training pattern are as follows:
Entity indicia computing module, for calculating the reality of each text in sentence to be processed according to the training pattern Body marks yj
Entity indicia sequence determining module forms sentence to be processed for the entity indicia of each text to be combined Entity indicia sequence;Wherein, x indicates the sentence to be processed;yjIndicate the reality of the corresponding text in the position j in sentence to be processed Body label;fi(yj,yj-1, x) and indicate the functional value that feature i is segmented in sentence to be processed;λiFor model parameter;M indicates that participle is special The number of sign;N indicates the text point number in sentence to be processed;Z (x) indicates normalization factor;P (y | x) indicate that text exists Marking probability in sentence to be processed.
In addition, first group of candidate entity determination unit, is specifically used for:
Determine the corresponding participle characteristic value of each text in entity indicia sequence, and according to participle characteristic value determination to Handle first group of candidate's entity of sentence.
Further, the symptom and sign class entity recognition device towards multi-data source further includes that corpus updates Unit is used for:
It is not marked in pre-set corpus in the sentence to be processed, according to formula:
Determine the uncertain value of each entity in sentence to be processed;Its In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed;
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, are being matched When success, the entity indicia of the entity of successful match is saved;
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label;
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold The sentence to be processed of value is added in the corpus, to carry out corpus update;
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed;
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is total for the entity predicted in sentence to be processed The entity number in pre-set dictionary is appeared in number;B is the entity sum predicted in sentence to be processed.
In addition, second group of candidate entity determination unit, comprising:
Preprocessing module for the punctuation mark in sentence to be processed to be converted to half-angle, and English alphabet is unified for Capitalization English letter;Pre-set non-medical term table is called, checks that the original character string in sentence to be processed whether there is Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, form pre- place Sentence to be processed after reason;
Symptom and sign ontology library matching module, for pretreated sentence to be processed to be used reverse maximum match principle It is matched with pre-set symptom and sign database, it will be in pretreated sentence to be processed and in symptom and sign database Standard terminology title or the character string that matches of synonym extracted out as preliminary entity, and by the standard terminology title or same Entity type of the term type corresponding to adopted word as the preliminary entity;By the original word of pretreated sentence to be processed Symbol string is matched with pre-set sentence pattern database;If the original character string of the pretreated sentence to be processed and pre- Sentence pattern format match in the sentence pattern database being first arranged then adopts the original character string of the pretreated sentence to be processed Matched with reverse maximum match principle with pre-set disease ontology database, by with the mark in disease ontology database The character string that quasi- term name or synonym match is extracted out as preliminary entity, and by the standard terminology title or synonym Entity type of the corresponding term type as the preliminary entity;
Second group of candidate's entity determining module, for using each preliminary entity in pretreated sentence to be processed as Two groups of candidate's entities.
In addition, candidate's entity screening unit, comprising:
Non- symptom and sign term character judgement module, it is each in first group of candidate's entity and second group of candidate's entity for judging Whether the end character of candidate entity is pre-set non-symptom and sign term character;
Candidate entity gives up module, is pre-set non-symptom and sign term for the end character in each candidate entity When character, the candidate entity is given up.
A kind of symptom and sign class entity recognition method and device towards multi-data source provided in an embodiment of the present invention, it is first First, the sentence to be processed in initial data is obtained;The sentence to be processed is subjected to individual character cutting, is determined in sentence to be processed Each text;According to the CRF training pattern that preparatory training is completed, determine each text in sentence to be processed in sentence to be processed In entity indicia, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, determine First group of candidate's entity of sentence to be processed;Then, according to pre-set symptom and sign class term cutting strategy, to it is described to It handles sentence and carries out term cutting, determine second group of candidate's entity;According in first group of candidate's entity and second group of candidate's entity The end character of each candidate's entity, screens each candidate entity, be respectively formed first group of symptom and sign class candidate entity and Second group of symptom and sign class candidate's entity;If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity It is not identical, it is candidate from first group of symptom and sign class candidate entity and second group of symptom and sign class according to pre-set determination strategy Symptom and sign class entity result is determined in entity.The present invention is by condition random field CRF statistical machine learning method and term cutting Method combines, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, real The problem of body identification inaccuracy.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without any creative labor, may be used also for those of ordinary skill in the art To obtain other drawings based on these drawings.
Fig. 1 is a kind of process of the symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention Figure one;
Fig. 2 is a kind of process of the symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention The part A of figure two;
Fig. 3 is a kind of process of the symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention The part B of figure two;
Fig. 4 is a kind of structure of the symptom and sign class entity recognition device towards multi-data source provided in an embodiment of the present invention Schematic diagram one;
Fig. 5 is a kind of structure of the symptom and sign class entity recognition device towards multi-data source provided in an embodiment of the present invention Schematic diagram two.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
As shown in Figure 1, the embodiment of the present invention provides a kind of symptom and sign class entity recognition method towards multi-data source, packet It includes:
Sentence to be processed in step 101, acquisition initial data.
The sentence to be processed is carried out individual character cutting by step 102, determines each text in sentence to be processed.
Step 103, according to the CRF training pattern that training is completed in advance, determine each text in sentence to be processed to The entity indicia in sentence is handled, and determines the entity indicia sequence of sentence to be processed.
Step 104, the entity indicia sequence according to sentence to be processed determine first group of candidate's entity of sentence to be processed.
Step 105, according to pre-set symptom and sign class term cutting strategy, term is carried out to the sentence to be processed Cutting determines second group of candidate's entity.
Step 106, according in first group of candidate's entity and second group of candidate's entity it is each candidate entity end character, to each Candidate entity is screened, and first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are respectively formed.
If step 107, first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity be not identical, root It is true from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity according to pre-set determination strategy Determine symptom and sign class entity result.
A kind of symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention, firstly, obtaining Sentence to be processed in initial data;The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed Word;According to the CRF training pattern that preparatory training is completed, reality of each text in sentence to be processed in sentence to be processed is determined Body label, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, determine to be processed First group of candidate's entity of sentence;Then, according to pre-set symptom and sign class term cutting strategy, to the language to be processed Sentence carries out term cutting, determines second group of candidate's entity;According to each candidate in first group of candidate's entity and second group of candidate's entity The end character of entity screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group Symptom and sign class candidate's entity;If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate entity not phase Together, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Middle determining symptom and sign class entity result.The present invention is by condition random field CRF statistical machine learning method and term cutting method Combine, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, entity know Not inaccurate problem.
In order to make those skilled in the art be better understood by the present invention, illustrate this hair below with reference to specific example It is bright.As shown in Figures 2 and 3 (wherein, Fig. 2 is a kind of part A of symptom and sign class entity recognition method towards multi-data source, Fig. 3 is a kind of part B of symptom and sign class entity recognition method towards multi-data source, is divided into A herein, part B is due to this The step of inventive embodiments, is more, not indicates the difference on practical significance, and part A and part B form entire step 201 to step Rapid 220, wherein figure 2 show step 201 to step 211, Fig. 3 shows step 212 to step 220.), the embodiment of the present invention A kind of symptom and sign class entity recognition method towards multi-data source is provided, comprising:
Sentence to be processed in step 201, acquisition initial data.
Specifically, the initial data includes symptom and sign clinical treatment data, symptom and sign research and development experimental data, symptom Sign sales data, symptom and sign scientific and technical literature data, symptom and sign electronic commerce data etc., but it is not only limited to this.
The sentence to be processed is carried out individual character cutting by step 202, determines each text in sentence to be processed.
For example, sentence to be processed is " dizzy aggravation before one week, with cough ", then after individual character cutting, each text are as follows: " one " " week " " preceding " " head " " dizzy " " adding " " play " ", " " companion " " cough " " coughing ".
Step 203, from the CRF statistical nature of each text extracted in pre-set corpus in sentence to be processed Value.
Record has each sentence in initial data, the entity in each sentence and each language in the pre-set corpus Position and entity class of the entity in each sentence in sentence;The CRF statistical characteristics includes each text in each sentence In participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value.
Pre-set corpus can be marked in advance by artificially, such as sentence:
" main suit: dizzy aggravation before one week, with cough.
Physical examination: without tenderness and rebound tenderness, 4 beats/min of gurgling sound."
Then for symptom and sign class entity, can mark out respectively:
C=dizziness P=1:6 1:7t=symptom
C=cough P=1:121:13t=symptom
C=tenderness P=2:42:5t=sign
C=rebound tenderness P=2:72:9t=sign
C=gurgling sound P=2:112:13t=sign
Wherein, c indicates that symptom and sign class entity, P indicate the line number and sentence of sentence in the corpus of symptom and sign class entity place Character position in son, t indicate that (symptom and sign entity class includes symptom entity and body to symptom and sign entity class in the present invention Levies in kind body).
For CRF statistical characteristics, such as sentence " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity indicia sequence It is classified as " OBEOBIEOBIEOOOO ".For example, CRF statistical nature is described as follows shown in table 1 for " pain " word:
Table 1:
Step 204, the CRF statistical characteristics according to each word in each sentence, determine a training pattern.
Wherein, the training pattern are as follows:
Step 205, according to the training pattern, calculate the entity indicia y of each text in sentence to be processedj
Wherein, x indicates the sentence to be processed;yjIndicate the entity indicia of the corresponding text in the position j in sentence to be processed; fi(yj,yj-1, x) and indicate the functional value that feature i is segmented in sentence to be processed;λiThe model parameter obtained for model parameter, training The sum of the training pattern p (y | x) of sentence can be made to reach maximum;M indicates the number of participle feature;N is indicated in sentence to be processed Text point number;Z (x) indicates normalization factor;P (y | x) indicate marking probability of the text in sentence to be processed.
For fi(yj,yj-1, x), if indicating yj、yj-1, x be both present in corpus, then fi(yj,yj-1, x)=1, otherwise It is 0.
The entity indicia of each text is combined by step 206, forms the entity indicia sequence of sentence to be processed.
Such as sentence " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity flag sequence is “OBEOBIEOBIEOOOO”。
Step 207 determines the corresponding participle characteristic value of each text in entity indicia sequence, and according to the participle feature Value determines first group of candidate's entity of sentence to be processed.
For example, entity flag sequence is for " no tenderness and rebound tenderness, 4 beats/min of gurgling sound " " OBEOBIEOBIEOOOO " therefore may recognize that first group of candidate's entity is " tenderness ", " rebound tenderness ", " gurgling sound ".
Punctuation mark in sentence to be processed is converted to half-angle, and English alphabet is unified for capitalization English by step 208 Letter.
Step 209 calls pre-set non-medical term table, checks whether the original character string in sentence to be processed is deposited Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pre- Treated sentence to be processed.
Pretreated sentence to be processed is used reverse maximum match principle and pre-set symptom body by step 210 Sign database is matched, by pretreated sentence to be processed with standard terminology title in symptom and sign database or same The character string that adopted word matches is extracted out as preliminary entity, and by term class corresponding to the standard terminology title or synonym Entity type of the type as the preliminary entity.
Herein, pre-set symptom and sign database may include symptom and sign tables of data as shown in table 2 below, the disease Shape sign data table can be expand on the basis of international ICD10 and authoritative medical tool book it is built-up, wherein including Concept category between word and word between synonymy, word and word divides relationship etc., is embodied in standard terminology title in table, same Adopted word, hypernym etc..
Table 2:
Standard terminology title Synonym Hypernym title Term type
Pain Symptom
Headache Pain Symptom
Tenderness Sign
Blood pressure Sign
Heart rate Sign
Step 211 carries out the original character string of pretreated sentence to be processed and pre-set sentence pattern database Matching.
If the original character string of step 212, the pretreated sentence to be processed and pre-set sentence pattern database In sentence pattern format match, then by the original character string of the pretreated sentence to be processed use reverse maximum match principle Matched with pre-set disease ontology database, by with the standard terminology title or synonym in disease ontology database The character string to match is extracted out as preliminary entity, and term type corresponding to the standard terminology title or synonym is made For the entity type of the preliminary entity.
Step 211 and step 212 are to be possible to not be extracted there are also entity in order to avoid remaining character string to fetch herein, Therefore it needs further judgement and extracts.
Pre-set sentence pattern database herein may include sentence pattern tables of data as shown in table 3 below:
Table 3:
Step 213, using each preliminary entity in pretreated sentence to be processed as second group of candidate's entity.
Through the above steps 210 and step 212 specific rules, second group of final candidate's entity can be formed.
Step 214, judge in first group of candidate's entity and second group of candidate's entity it is each candidate entity end character whether For pre-set non-symptom and sign term character.
The pre-set non-symptom and sign term character can be such as " medicine, operation, art, inspection " etc..
If the end character of step 215, each candidate entity is pre-set non-symptom and sign term character, by the time Entity is selected to give up.
After step 215, step 216 or step 219 are executed.
Step 216, in first group of symptom and sign class candidate entity and not identical second group of symptom and sign class candidate's entity, Determine sentence to be processed when carrying out term cutting, if to carry out cutting by pre-set segmentation rules.
I.e. whether through the above steps 211 and step 212 processing.
After step 216, step 217 or step 218 are executed.
If step 217, sentence to be processed carry out cutting when carrying out term cutting, through pre-set segmentation rules, Then select the candidate entity in second group of symptom and sign class candidate's entity as symptom and sign class entity result.
If step 218, sentence to be processed are not cut by pre-set segmentation rules when carrying out term cutting Point, then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result.
Step 219, in first group of symptom and sign class candidate entity and not identical second group of symptom and sign class candidate's entity, Determine the first group of symptom and sign class candidate entity and second group of symptom body of the original character string from identical sentence to be processed Levy class candidate entity in, entity number is few, and entity include number of characters more than a group object as symptom and sign class entity knot Fruit.
For example, initial data is " showing as abdominal distention ".
First group of symptom and sign class candidate's entity is " abdominal distention [symptom] ";
Second group of symptom and sign class candidate's entity is " bulge [symptom] ";
Then, final result is " abdominal distention [symptom] ".
After step 217,218 and step 219, step 220 is executed.
Step 220, the phase in first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity When the entity type of corresponding entity is inconsistent, select the entity type of the entity in second group of candidate's entity as described opposite The entity type for the entity answered.
For example, initial data is " having peritonitis performance therewith ".
First group of symptom and sign class candidate's entity is " peritonitis [disease] ";
Second group of symptom and sign class candidate's entity is " peritonitis [symptom] ";
Then, final result is " peritonitis [symptom] ".
201 to step 220 through the above steps, may finally obtain symptom and sign class Entity recognition result.
In addition, being updated to realize to corpus, new sentence pattern feature can be found by manually summarizing, and manually mark Note is added in corpus;Furthermore it is also possible to be not marked in pre-set corpus in the sentence to be processed, according to Formula:Determine the uncertain value of each entity in sentence to be processed;Its In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed.
For example, " no tenderness and rebound tenderness, 4 beats/min of gurgling sound ", entity indicia sequence is " OBEOBIEOBIEOOOO ", position Setting sequence is " 0123456789 10 11 12 13 14 ", it will be seen that entity " tenderness ", position are " 12 ", therefore, KstartIt is 1, KendIt is 2.
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, if matching Success, then save the entity indicia of the entity of successful match.
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label.
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold The sentence to be processed of value is added in the corpus, to carry out corpus update.
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed.
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is the entity predicted in sentence to be processed The entity number in pre-set dictionary is appeared in sum;B is the entity sum predicted in sentence to be processed.
As it can be seen that corpus data needed for Entity recognition may be implemented and utilize semi-supervised self study side by the update of corpus Method realizes that corpus is enriched constantly, solves the problems, such as corpus number deficiency, incomplete.
A kind of symptom and sign class entity recognition method towards multi-data source provided in an embodiment of the present invention, firstly, obtaining Sentence to be processed in initial data;The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed Word;According to the CRF training pattern that preparatory training is completed, reality of each text in sentence to be processed in sentence to be processed is determined Body label, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, determine to be processed First group of candidate's entity of sentence;Then, according to pre-set symptom and sign class term cutting strategy, to the language to be processed Sentence carries out term cutting, determines second group of candidate's entity;According to each candidate in first group of candidate's entity and second group of candidate's entity The end character of entity screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group Symptom and sign class candidate's entity;If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate entity not phase Together, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Middle determining symptom and sign class entity result.The present invention is by condition random field CRF statistical machine learning method and term cutting method Combine, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, entity know Not inaccurate problem.
Corresponding to above-mentioned Fig. 1, Fig. 2 and embodiment of the method shown in Fig. 3, as shown in figure 4, the embodiment of the present invention provides one kind Symptom and sign class entity recognition device towards multi-data source, comprising:
Sentence acquiring unit 31 to be processed, for obtaining the sentence to be processed in initial data.
Individual character cutting unit 32 determines every in sentence to be processed for the sentence to be processed to be carried out individual character cutting A text.
Entity indicia sequence determination unit 33, for determining language to be processed according to the CRF training pattern that training is completed in advance Entity indicia of each text in sentence to be processed in sentence, and determine the entity indicia sequence of sentence to be processed.
First group of candidate's entity determination unit 34 determines to be processed for the entity indicia sequence according to sentence to be processed First group of candidate's entity of sentence.
Second group of candidate's entity determination unit 35 is used for according to pre-set symptom and sign class term cutting strategy, right The sentence to be processed carries out term cutting, determines second group of candidate's entity.
Candidate entity screening unit 36, for according to candidate's entity each in first group of candidate's entity and second group of candidate's entity End character, each candidate entity is screened, first group of symptom and sign class candidate entity and second group of symptom are respectively formed Sign class candidate's entity.
Symptom and sign class entity result determination unit 37, in first group of symptom and sign class candidate entity and second group of disease When shape sign class candidate's entity is not identical, according to pre-set determination strategy from first group of symptom and sign class candidate entity and Symptom and sign class entity result is determined in two groups of symptom and sign class candidate's entities.
Specifically, as shown in figure 5, the symptom and sign class entity result determination unit 37, comprising:
Term cutting judgment module 371, for determining sentence to be processed when carrying out term cutting, if by setting in advance The segmentation rules set carry out cutting.
Symptom and sign class entity result determining module 372 is used in sentence to be processed when carrying out term cutting, by pre- The segmentation rules being first arranged carry out cutting, then select the candidate entity in second group of symptom and sign class candidate's entity as disease Shape sign class entity result;In sentence to be processed when carrying out term cutting, do not cut by pre-set segmentation rules Point, then select the candidate entity in first group of symptom and sign class candidate's entity as symptom and sign class entity result.
The symptom and sign class entity result determining module 372 is also used to determine the original for deriving from identical sentence to be processed In first group of symptom and sign class candidate entity of beginning character string and second group of symptom and sign class candidate's entity, entity number is few, and A group object more than the number of characters that entity includes is as symptom and sign class entity result;In the symptom and sign class entity result Entity type includes symptom entity and sign entity.
Entity type determining module 373, in first group of symptom and sign class candidate entity and second group of symptom body When the entity type of corresponding entity is inconsistent in sign class candidate entity, the entity of the entity in second group of candidate's entity is selected Entity type of the type as the corresponding entity.
Specifically, the initial data in the sentence acquiring unit 31 to be processed includes electronic health record data, clearing odd number According to, clinical research data, medical knowledge base data, periodical literature data.
Further, as shown in figure 5, the entity indicia sequence determination unit 33, comprising:
CRF statistical characteristics extraction module 331, for every in sentence to be processed from being extracted in pre-set corpus The CRF statistical characteristics of a text;Record has each sentence in initial data, in each sentence in the pre-set corpus Position and entity class of the entity in each sentence in entity and each sentence;The CRF statistical characteristics includes each Participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature feature of the text in each sentence Value.
Training pattern determining module 332 determines an instruction for the CRF statistical characteristics according to each word in each sentence Practice model;The training pattern are as follows:
Entity indicia computing module 333, for calculating each text in sentence to be processed according to the training pattern Entity indicia yj
Entity indicia sequence determining module 334 forms language to be processed for the entity indicia of each text to be combined The entity indicia sequence of sentence;Wherein, x indicates the sentence to be processed;yjIndicate the corresponding text in the position j in sentence to be processed Entity indicia;fi(yj,yj-1, x) and indicate the functional value that feature i is segmented in sentence to be processed;λiFor model parameter;M indicates participle The number of feature;N indicates the text point number in sentence to be processed;Z (x) indicates normalization factor;P (y | x) indicate text Marking probability in sentence to be processed.
In addition, first group of candidate's entity determination unit 34, is specifically used for:
Determine the corresponding participle characteristic value of each text in entity indicia sequence, and according to participle characteristic value determination to Handle first group of candidate's entity of sentence.
Further, as shown in figure 5, the symptom and sign class entity recognition device towards multi-data source, further includes Corpus updating unit 38 is used for:
It is not marked in pre-set corpus in the sentence to be processed, according to formula:
Determine the uncertain value of each entity in sentence to be processed;Its In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed.
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, are being matched When success, the entity indicia of the entity of successful match is saved.
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label.
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold The sentence to be processed of value is added in the corpus, to carry out corpus update.
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed.
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is the entity predicted in sentence to be processed The entity number in pre-set dictionary is appeared in sum;B is the entity sum predicted in sentence to be processed.
In addition, as shown in figure 5, second group of candidate's entity determination unit 35, comprising:
Preprocessing module 351, for the punctuation mark in sentence to be processed to be converted to half-angle, and English alphabet is unified For capitalization English letter;Pre-set non-medical term table is called, checks whether the original character string in sentence to be processed is deposited Term in non-medical term table, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pre- Treated sentence to be processed.
Symptom and sign ontology library matching module 352, for matching pretreated sentence to be processed using reverse maximum Principle is matched with pre-set symptom and sign database, by pretreated sentence to be processed with symptom and sign data The character string that standard terminology title or synonym in library match is extracted out as preliminary entity, and by the standard terminology title Or entity type of the term type corresponding to synonym as the preliminary entity;By the original of pretreated sentence to be processed Beginning character string is matched with pre-set sentence pattern database;If the original character string of the pretreated sentence to be processed With the sentence pattern format match in pre-set sentence pattern database, then by the original character of the pretreated sentence to be processed String is matched with pre-set disease ontology database using reverse maximum match principle, will in disease ontology database Standard terminology title or the character string that matches of synonym extracted out as preliminary entity, and by the standard terminology title or same Entity type of the term type corresponding to adopted word as the preliminary entity.
Second group of candidate's entity determining module 353, for making each preliminary entity in pretreated sentence to be processed For second group of candidate's entity.
In addition, as shown in figure 5, candidate's entity screening unit 36, comprising:
Non- symptom and sign term character judgement module 361, for judging first group of candidate's entity and second group of candidate's entity In the end character of each candidate entity whether be pre-set non-symptom and sign term character.
Candidate entity gives up module 362, is pre-set non-symptom and sign for the end character in each candidate entity When term character, the candidate entity is given up.
It is worth noting that a kind of symptom and sign class Entity recognition dress towards multi-data source provided in an embodiment of the present invention The specific implementation set may refer to above-mentioned embodiment of the method, and details are not described herein again.
A kind of symptom and sign class entity recognition device towards multi-data source provided in an embodiment of the present invention, firstly, obtaining Sentence to be processed in initial data;The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed Word;According to the CRF training pattern that preparatory training is completed, reality of each text in sentence to be processed in sentence to be processed is determined Body label, and determine the entity indicia sequence of sentence to be processed;According to the entity indicia sequence of sentence to be processed, determine to be processed First group of candidate's entity of sentence;Then, according to pre-set symptom and sign class term cutting strategy, to the language to be processed Sentence carries out term cutting, determines second group of candidate's entity;According to each candidate in first group of candidate's entity and second group of candidate's entity The end character of entity screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group Symptom and sign class candidate's entity;If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate entity not phase Together, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Middle determining symptom and sign class entity result.The present invention is by condition random field CRF statistical machine learning method and term cutting method Combine, can automatic identification symptom and sign class entity, the data source for overcoming current Entity recognition is more single, entity know Not inaccurate problem.
It should be understood by those skilled in the art that, the embodiment of the present invention can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the present invention Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the present invention, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The present invention be referring to according to the method for the embodiment of the present invention, the process of equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Specific embodiment is applied in the present invention, and principle and implementation of the present invention are described, above embodiments Explanation be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, According to the thought of the present invention, there will be changes in the specific implementation manner and application range, in conclusion in this specification Appearance should not be construed as limiting the invention.

Claims (12)

1. a kind of symptom and sign class entity recognition method towards multi-data source characterized by comprising
Obtain the sentence to be processed in initial data;
The sentence to be processed is subjected to individual character cutting, determines each text in sentence to be processed;
According to the CRF training pattern that preparatory training is completed, determine each text in sentence to be processed in sentence to be processed Entity indicia, and determine the entity indicia sequence of sentence to be processed;
According to the entity indicia sequence of sentence to be processed, first group of candidate's entity of sentence to be processed is determined;
According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to the sentence to be processed, determines the Two groups of candidate's entities;
According to the end character of candidate's entity each in first group of candidate's entity and second group of candidate's entity, each candidate entity is carried out Screening, is respectively formed first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity;
If first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity be not identical, according to pre-set Determination strategy determines symptom and sign class from first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity Entity result;
It is described to be waited according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of symptom and sign class It selects and determines symptom and sign class entity result in entity, comprising:
Determine sentence to be processed when carrying out term cutting, if to carry out cutting by pre-set segmentation rules;
If sentence to be processed carries out cutting when carrying out term cutting, through pre-set segmentation rules, then described the is selected Candidate entity in two groups of symptom and sign class candidate's entities is as symptom and sign class entity result;
If sentence to be processed when carrying out term cutting, does not carry out cutting by pre-set segmentation rules, then described in selection Candidate entity in first group of symptom and sign class candidate's entity is as symptom and sign class entity result;
Alternatively, determining first group of symptom and sign class candidate entity and second of the original character string from identical sentence to be processed Group symptom and sign class candidate's entity in, entity number is few, and entity include number of characters more than a group object as symptom and sign Class entity result;
Entity type in the symptom and sign class entity result includes symptom entity and sign entity;
The corresponding entity in first group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity When entity type is inconsistent, select the entity type of the entity in second group of candidate's entity as the reality of the corresponding entity Body type;
According to pre-set symptom and sign class term cutting strategy, term cutting is carried out to the sentence to be processed, determines the Two groups of candidate's entities, comprising:
Punctuation mark in sentence to be processed is converted into half-angle, and English alphabet is unified for capitalization English letter;
Pre-set non-medical term table is called, checks the original character string in sentence to be processed with the presence or absence of non-medical term Term in table, and the term in non-medical term table present in sentence to be processed is deleted, it is formed pretreated wait locate Manage sentence;
Pretreated sentence to be processed is carried out using reverse maximum match principle and pre-set symptom and sign database Matching, by pretreated sentence to be processed in symptom and sign database standard terminology title or synonym match Character string is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as described first Walk the entity type of entity;
The original character string of pretreated sentence to be processed is matched with pre-set sentence pattern database;
If the sentence pattern format in the original character string of the pretreated sentence to be processed and pre-set sentence pattern database Matching, then by the original character string of the pretreated sentence to be processed using reverse maximum match principle with it is pre-set Disease ontology database is matched, by with the standard terminology title or the character that matches of synonym in disease ontology database String is extracted out as preliminary entity, and using term type corresponding to the standard terminology title or synonym as the preliminary reality The entity type of body;
Using each preliminary entity in pretreated sentence to be processed as second group of candidate's entity.
2. the symptom and sign class entity recognition method according to claim 1 towards multi-data source, which is characterized in that described Initial data includes electronic health record data, clearing forms data, clinical research data, medical knowledge base data, periodical literature data.
3. the symptom and sign class entity recognition method according to claim 2 towards multi-data source, which is characterized in that according to The CRF training pattern that training is completed in advance, determines entity indicia of each text in sentence to be processed in sentence to be processed, And determine the entity indicia sequence of sentence to be processed, comprising:
From the CRF statistical characteristics of each text extracted in pre-set corpus in sentence to be processed;It is described to set in advance Record has each sentence in initial data, the entity in each sentence and the entity in each sentence in each sentence in the corpus set In position and entity class;The CRF statistical characteristics includes participle characteristic value, part of speech of each text in each sentence Characteristic value, character feature value, contextual feature value and nomenclature characteristic value;
According to CRF statistical characteristics of each word in each sentence, a training pattern is determined;The training pattern are as follows:
According to the training pattern, the entity indicia y of each text in sentence to be processed is calculatedj
The entity indicia of each text is combined, the entity indicia sequence of sentence to be processed is formed;Wherein, described in x expression Sentence to be processed;yjIndicate the entity indicia of the corresponding text in the position j in sentence to be processed;fi(yj,yj-1, x) and indicate to be processed The functional value of feature i is segmented in sentence;λiFor model parameter;M indicates the number of participle feature;N is indicated in sentence to be processed Text point number;Z (x) indicates normalization factor;P (y | x) indicate marking probability of the text in sentence to be processed.
4. the symptom and sign class entity recognition method according to claim 3 towards multi-data source, which is characterized in that according to The entity indicia sequence of sentence to be processed determines first group of candidate's entity of sentence to be processed, comprising:
The corresponding participle characteristic value of each text is determined in entity indicia sequence, and is determined according to the participle characteristic value to be processed First group of candidate's entity of sentence.
5. the symptom and sign class entity recognition method according to claim 4 towards multi-data source, which is characterized in that also wrap It includes:
It is not marked in pre-set corpus in the sentence to be processed, according to formula:
Determine the uncertain value of each entity in sentence to be processed;Its In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed;
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, if successful match, Then the entity indicia of the entity of successful match is saved;
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label;
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold value Sentence to be processed is added in the corpus, to carry out corpus update;
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed;
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is in the entity sum predicted in sentence to be processed Appear in the entity number in pre-set dictionary;B is the entity sum predicted in sentence to be processed.
6. the symptom and sign class entity recognition method according to claim 5 towards multi-data source, which is characterized in that according to The end character of each candidate's entity in first group of candidate's entity and second group of candidate's entity, screens each candidate entity, point First group of symptom and sign class candidate entity and second group of symptom and sign class candidate's entity are not formed, comprising:
Whether the end character for judging each candidate's entity in first group of candidate's entity and second group of candidate's entity is pre-set Non- symptom and sign term character;
If the end character of each candidate's entity is pre-set non-symptom and sign term character, the candidate entity is given up.
7. a kind of symptom and sign class entity recognition device towards multi-data source characterized by comprising
Sentence acquiring unit to be processed, for obtaining the sentence to be processed in initial data;
Individual character cutting unit determines each text in sentence to be processed for the sentence to be processed to be carried out individual character cutting;
Entity indicia sequence determination unit, for determining in sentence to be processed according to the CRF training pattern that training is completed in advance Entity indicia of each text in sentence to be processed, and determine the entity indicia sequence of sentence to be processed;
First group of candidate's entity determination unit determines sentence to be processed for the entity indicia sequence according to sentence to be processed First group of candidate's entity;
Second group of candidate's entity determination unit, for according to pre-set symptom and sign class term cutting strategy, to it is described to It handles sentence and carries out term cutting, determine second group of candidate's entity;
Candidate entity screening unit, for the end according to candidate's entity each in first group of candidate's entity and second group of candidate's entity Character screens each candidate entity, is respectively formed first group of symptom and sign class candidate entity and second group of symptom and sign class Candidate entity;
Symptom and sign class entity result determination unit, in first group of symptom and sign class candidate entity and second group of symptom and sign When class candidate's entity is not identical, according to pre-set determination strategy from first group of symptom and sign class candidate entity and second group of disease Symptom and sign class entity result is determined in shape sign class candidate's entity;
The symptom and sign class entity result determination unit, comprising:
Term cutting judgment module, for determining sentence to be processed when carrying out term cutting, if cut by pre-set Divider then carries out cutting;
Symptom and sign class entity result determining module is used in sentence to be processed when carrying out term cutting, by presetting Segmentation rules carry out cutting, then select the candidate entity in second group of symptom and sign class candidate's entity as symptom and sign Class entity result;In sentence to be processed when carrying out term cutting, cutting is not carried out by pre-set segmentation rules, then is selected The candidate entity in first group of symptom and sign class candidate's entity is selected as symptom and sign class entity result;Alternatively, for true Surely from the first group of symptom and sign class candidate entity and second group of symptom and sign of the original character string of identical sentence to be processed In class candidate's entity, entity number is few, and entity include number of characters more than a group object as symptom and sign class entity result; Entity type in the symptom and sign class entity result includes symptom entity and sign entity;
Entity type determining module, for candidate in first group of symptom and sign class candidate entity and second group of symptom and sign class When the entity type of corresponding entity is inconsistent in entity, select the entity type of the entity in second group of candidate's entity as The entity type of the corresponding entity;
Second group of candidate entity determination unit, comprising:
English alphabet for the punctuation mark in sentence to be processed to be converted to half-angle, and is unified for capitalization by preprocessing module English alphabet;Pre-set non-medical term table is called, checks the original character string in sentence to be processed with the presence or absence of non-doctor Term in technics table, and the term in non-medical term table present in sentence to be processed is deleted, after forming pretreatment Sentence to be processed;
Symptom and sign ontology library matching module, for will pretreated sentence to be processed using reverse maximum match principle in advance The symptom and sign database being first arranged is matched, by pretreated sentence to be processed with the mark in symptom and sign database The character string that quasi- term name or synonym match is extracted out as preliminary entity, and by the standard terminology title or synonym Entity type of the corresponding term type as the preliminary entity;By the original character string of pretreated sentence to be processed It is matched with pre-set sentence pattern database;If the original character string of the pretreated sentence to be processed with set in advance Sentence pattern format match in the sentence pattern database set, then by the original character string of the pretreated sentence to be processed using inverse Matched to maximum match principle with pre-set disease ontology database, by with the standard art in disease ontology database The character string that language title or synonym match is extracted out as preliminary entity, and the standard terminology title or synonym institute is right Entity type of the term type answered as the preliminary entity;
Second group of candidate's entity determining module, for using each preliminary entity in pretreated sentence to be processed as second group Candidate entity.
8. the symptom and sign class entity recognition device according to claim 7 towards multi-data source, which is characterized in that described Initial data in sentence acquiring unit to be processed includes that electronic health record data, clearing forms data, clinical research data, medicine are known Know library data, periodical literature data.
9. the symptom and sign class entity recognition device according to claim 8 towards multi-data source, which is characterized in that described Entity indicia sequence determination unit, comprising:
CRF statistical characteristics extraction module, for from each text extracted in pre-set corpus in sentence to be processed CRF statistical characteristics;In the pre-set corpus record have each sentence in initial data, the entity in each sentence, And position and entity class of the entity in each sentence in each sentence;The CRF statistical characteristics includes each text Participle characteristic value, part of speech feature value, character feature value, contextual feature value and nomenclature characteristic value in each sentence;
Training pattern determining module determines a training pattern for the CRF statistical characteristics according to each word in each sentence; The training pattern are as follows:
Entity indicia computing module, for calculating the entity mark of each text in sentence to be processed according to the training pattern Remember yj
Entity indicia sequence determining module forms the reality of sentence to be processed for the entity indicia of each text to be combined Body flag sequence;Wherein, x indicates the sentence to be processed;yjIndicate the entity mark of the corresponding text in the position j in sentence to be processed Note;fi(yj,yj-1, x) and indicate the functional value that feature i is segmented in sentence to be processed;λiFor model parameter;M indicates participle feature Number;N indicates the text point number in sentence to be processed;Z (x) indicates normalization factor;P (y | x) indicate text wait locate Manage the marking probability in sentence.
10. the symptom and sign class entity recognition device according to claim 9 towards multi-data source, which is characterized in that institute First group of candidate's entity determination unit is stated, is specifically used for:
The corresponding participle characteristic value of each text is determined in entity indicia sequence, and is determined according to the participle characteristic value to be processed First group of candidate's entity of sentence.
11. the symptom and sign class entity recognition device according to claim 10 towards multi-data source, which is characterized in that also Including corpus updating unit, it is used for:
It is not marked in pre-set corpus in the sentence to be processed, according to formula:
Determine the uncertain value of each entity in sentence to be processed;Its In, IEkFor the uncertain value of k-th of entity;kstartFor the starting position of the entity indicia of k-th of entity;kendIt is real for k-th The tail position of the entity indicia of body;For the probability of corresponding j-th of the entity indicia of text of the position s in sentence to be processed;
Entity and pre-set symptom and sign ontology storehouse matching of the value for 1 will not be known in sentence to be processed, in successful match When, the entity indicia of the entity of successful match is saved;
Determine the forecast confidence of sentence to be processed and the solid proportional of dictionary pattern matching label;
The solid proportional that forecast confidence is greater than default confidence threshold value and dictionary pattern matching label is greater than preset ratio threshold value Sentence to be processed is added in the corpus, to carry out corpus update;
Wherein, the forecast confidence is the product of the corresponding marking probability of each text in sentence to be processed;
The solid proportional of the dictionary pattern matching label are as follows:Wherein, C is in the entity sum predicted in sentence to be processed Appear in the entity number in pre-set dictionary;B is the entity sum predicted in sentence to be processed.
12. the symptom and sign class entity recognition device according to claim 7 towards multi-data source, which is characterized in that institute State candidate entity screening unit, comprising:
Non- symptom and sign term character judgement module, for judging each candidate in first group of candidate's entity and second group of candidate's entity Whether the end character of entity is pre-set non-symptom and sign term character;
Candidate entity gives up module, is pre-set non-symptom and sign term character for the end character in each candidate entity When, the candidate entity is given up.
CN201710103706.4A 2017-02-24 2017-02-24 A kind of symptom and sign class entity recognition method and device towards multi-data source Active CN106897559B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710103706.4A CN106897559B (en) 2017-02-24 2017-02-24 A kind of symptom and sign class entity recognition method and device towards multi-data source

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710103706.4A CN106897559B (en) 2017-02-24 2017-02-24 A kind of symptom and sign class entity recognition method and device towards multi-data source

Publications (2)

Publication Number Publication Date
CN106897559A CN106897559A (en) 2017-06-27
CN106897559B true CN106897559B (en) 2019-09-17

Family

ID=59184141

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710103706.4A Active CN106897559B (en) 2017-02-24 2017-02-24 A kind of symptom and sign class entity recognition method and device towards multi-data source

Country Status (1)

Country Link
CN (1) CN106897559B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107808124B (en) * 2017-10-09 2019-03-26 平安科技(深圳)有限公司 Electronic device, the recognition methods of medical text entities name and storage medium
CN107993724B (en) * 2017-11-09 2020-11-13 易保互联医疗信息科技(北京)有限公司 Medical intelligent question and answer data processing method and device
CN108170677B (en) * 2017-12-27 2022-01-04 北京嘉和海森健康科技有限公司 Medical term extraction method and device
CN108986908B (en) * 2018-05-31 2023-04-18 平安医疗科技有限公司 Method and device for processing inquiry data, computer equipment and storage medium
CN109524121B (en) * 2018-11-09 2020-11-10 贵州医渡云技术有限公司 Medical file processing method and device
CN111615697A (en) * 2018-12-24 2020-09-01 北京嘀嘀无限科技发展有限公司 Artificial intelligence medical symptom recognition system based on text segment search
CN110162782B (en) * 2019-04-17 2022-04-01 平安科技(深圳)有限公司 Entity extraction method, device and equipment based on medical dictionary and storage medium
CN110176315B (en) * 2019-06-05 2022-06-28 京东方科技集团股份有限公司 Medical question-answering method and system, electronic equipment and computer readable medium
CN110298036B (en) * 2019-06-06 2022-07-22 昆明理工大学 Online medical text symptom identification method based on part-of-speech incremental iteration
CN111079420B (en) * 2019-12-19 2023-04-07 天津新开心生活科技有限公司 Text recognition method and device, computer readable medium and electronic equipment
CN111627561B (en) * 2020-05-25 2023-05-12 讯飞医疗科技股份有限公司 Standard symptom extraction method, device, electronic equipment and storage medium
CN112507703B (en) * 2020-12-07 2022-11-08 医渡云(北京)技术有限公司 Medical entity identification method, device, medium and electronic equipment
CN112784590A (en) * 2021-02-01 2021-05-11 北京金山数字娱乐科技有限公司 Text processing method and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110035210A1 (en) * 2009-08-10 2011-02-10 Benjamin Rosenfeld Conditional random fields (crf)-based relation extraction system
CN102033950A (en) * 2010-12-23 2011-04-27 哈尔滨工业大学 Construction method and identification method of automatic electronic product named entity identification system
CN106202054B (en) * 2016-07-25 2018-12-14 哈尔滨工业大学 A kind of name entity recognition method towards medical field based on deep learning
CN106407183B (en) * 2016-09-28 2019-06-28 医渡云(北京)技术有限公司 Medical treatment name entity recognition system generation method and device

Also Published As

Publication number Publication date
CN106897559A (en) 2017-06-27

Similar Documents

Publication Publication Date Title
CN106897559B (en) A kind of symptom and sign class entity recognition method and device towards multi-data source
CN107330011B (en) The recognition methods of the name entity of more strategy fusions and device
CN105589844B (en) It is a kind of to be used to take turns the method for lacking semantic supplement in question answering system more
CN106934220B (en) Disease class entity recognition method and device towards multi-data source
CN107291783B (en) Semantic matching method and intelligent equipment
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN102262634B (en) Automatic questioning and answering method and system
CN108984683A (en) Extracting method, system, equipment and the storage medium of structural data
US20210342371A1 (en) Method and Apparatus for Processing Knowledge Graph
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN110147451B (en) Dialogue command understanding method based on knowledge graph
CN110688489B (en) Knowledge graph deduction method and device based on interactive attention and storage medium
CN106844351B (en) Medical institution organization entity identification method and device oriented to multiple data sources
CN107729468A (en) Answer extracting method and system based on deep learning
CN109800414A (en) Faulty wording corrects recommended method and system
CN109635288A (en) A kind of resume abstracting method based on deep neural network
CN109522397B (en) Information processing method and device
CN106557563A (en) Query statement based on artificial intelligence recommends method and device
CN112800239B (en) Training method of intention recognition model, and intention recognition method and device
CN107247751B (en) LDA topic model-based content recommendation method
CN109213856A (en) A kind of method for recognizing semantics and system
CN109271524A (en) Entity link method in knowledge base question answering system
CN112613321A (en) Method and system for extracting entity attribute information in text
CN106933802B (en) Multi-data-source-oriented social security entity identification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20200123

Address after: 100027 Chaoyang District Xinyuan 16, Beijing 14 floor 2 12B06

Co-patentee after: HARBIN INSTITUTE OF TECHNOLOGY

Patentee after: Yi Bao Interconnected Medical Information Technology (Beijing) Co., Ltd.

Address before: 150000 Heilongjiang Province, Harbin City Economic Development Zone haping Road District Road No. 9 China Songhua Valley Software Park Building 1, room 214

Co-patentee before: HARBIN INSTITUTE OF TECHNOLOGY

Patentee before: Heilongjiang Teshi Information Technology Co. Ltd.

TR01 Transfer of patent right