CN106383853A

CN106383853A - Realization method and system for electronic medical record post-structuring and auxiliary diagnosis

Info

Publication number: CN106383853A
Application number: CN201610787187.3A
Authority: CN
Inventors: 刘勇; 琚生根; 王俊峰; 苏翀
Original assignee: Individual
Current assignee: Individual
Priority date: 2016-08-30
Filing date: 2016-08-30
Publication date: 2017-02-08

Abstract

The invention relates to a realization method and system for electronic medical record post-structuring and auxiliary diagnosis. A combination mode of multiple types of distance measurement is used: a character string editing distance refers to a minimum number of replacement, insertion and deletion operations required for converting a character into another character string; a Jaro-Winkler distance measures similarity between two character strings and is used for repeated recording detection; a geometric mean value of a Chinese character distance and a Chinese character input method is adopted as comprehensive similarity measurement for measuring similarity between characteristic texts; characteristic ranking is realized by using a TF-IDF method and is used for assessing the importance of characteristic terms relative to documents in a file set or a corpus library, and the importance of the characteristic terms is in direct proportion to an occurrence frequency in the documents and is in inverse proportion to an occurrence document in the corpus library; and files are converted to be in a file format of PU learning of a positive example data set and an unlabelled data set according to the generated characteristic terms, and through the PU learning, the system automatically recommends related diagnoses for clinical medical personnel to refer.

Description

The implementation method of structuring and auxiliary diagnosis and its system after a kind of electronic health record

Technical field

The present invention relates to a kind of electronic health record structural system and its implementation are and in particular to structure after a kind of electronic health record The implementation method of change and auxiliary diagnosis and its system.

Background technology

Traditional electronic health record data is record in the form of word description, although the certain mark of the structure of case history Standard is as foundation, but because relevant medical clinical field is more complicated, has each different contents in field, even same Content, corresponding description method is also had nothing in common with each other, therefore will preferably generating structure electronic health record difficult：By certainly So Language Processing (Natural Language Processing, NLP) is extracted structurized content from plain text describes It is a kind of method.A solution is also had to be to realize the structuring of medical record information, full structure by structuring typing mode The electronic medical record system changed can not represent the truly expressed of clinician sometimes completely, and full structuring comes to user of service Say that requirement is very high although full structuring can bring certain facility for clinical data analysis research.Such mode is to case history Standardisation requirements are higher, and structuring will have corresponding standard medical term to describe, but in standard medical term coded system Conception division will not be so fine, and standardize bring accuracy be contradiction with the flexibility of typing in practical application Although successively having released the respective standard solving this problem in the world, such as:SNOMED(The Systematized Nomenclature of Human and Veterinary Medicine), SNOMED CT, ICD-10 (International Classification of Diseases System) etc., but generally require to make very big adjustment in actual applications, and the Chinesizing work about standard also relatively lags behind, These are all the correlative factors in impact structuring typing using standardization medical terminology, and these factors also can be to electronics disease simultaneously The excavation going through middle relevant medical data brings certain impact.Hospitals at Present clinical worker operate electronic medical record system when Wait, due to the presence of objective circumstances, be difficult to use structurized electronic medical record system entirely interior on a large scale.Although structured electronic Case history has many good qualities, but difficulty is higher because it implements, and the requirement to user of service is very high.Comparatively speaking, freely Text input mode is much more flexible, and is easy to promotion and implementation and use.

The electronic medical record system of domestic main flow also allows for being realized with structurized method when design at present, but It is the complexity due to medical science and polytropy, to realize difficulty higher for structurized electronic health record entirely.Some electronic medical record system warps After being designed to accordingly to support structuring and clinical decision auxiliary, but it is necessary to according to electronics when practical operation Case history code requirement, to input, also usually must complete according to the unit that system provides when input.Due to input specification relatively, Can extract to subsequent data and using offering convenience；But premise is whether structured stencil design can be more conform with case history Structurized requirement.Structured stencil needs the personnel of professional domain to cooperate, and workload is very big, such as：Structuring nursing note Record, operation record etc., for the difference of patient's check point, design various anesthesia methods, nursing degree etc. making different Template, the personnel also needing to Medical Technologist's rank of professional domain in the middle of this participate in, and the degree of peopleware and participation is to knot The good bad influence of fruit is very big, and template construct is difficult to meet the situation of all complexity.In addition, relating in structured electronic patient record And to medical terminology standardization issue still lack complete, unified, being easy to use at present and have a large amount of practical application bases The taxonomic hierarchies of plinth and relevant criterion.

Although there is SNOMED, SNOMED CT, ICD-10, ICD-9-CM (International Classification of Diseases clinic correction) etc. Standard terminology collection, but to realize because these standard terminology collection are substantially to translate by foreign language, therefore and not all standard The Chinesizing effect of term can be satisfactory, more or less can bring some inconvenience in real work.Based on these shortcomings above-mentioned, The effect that structured patient record is implemented is simultaneously unsatisfactory, especially as { body part } { conventional symptom } { numeral } { chronomere } this Plant norm structure, be not to be seen everywhere.It is reported that, the electronic medical record system that most hospitals are used at present in design not There is consideration structuring, the even so-called institutional electronic health record based on XML is also partly structuring, is not real Structuring in meaning, can only be the structuring of part, and as main suit, past medical history, laboratory examination etc., these are more is to be based on The part of free Characters, but this part information of being comprised often most reference significance, the feature wherein comprising Element has critically important directive significance for clinical research.

Although some articles existing refer to the structuring to electronic health record and the feature recognition to structural data, Premise is used electronic medical record system in design, just according to standardized structure design, produces the customization meeting requirement Template, is carried out according to structured way when typing, the Medicine standard term of the relatively specification that term is also Collection.Regrettably, many electronic medical record systems so do not design or do not have in use to accomplish.Actual subscript Standardization and typing liberalization originally contradiction, if thinking, standardization just certainly will affect the free degree, if thinking, liberalization will produce Much nonstandard data, this is accomplished by these substantial amounts of nonstandard data being analysed in depth, using specific technology Carry out featured terms screening, refinement, analysis, only handle each step pilot process well, could be that clinical data analysis is carried out Significant guidance is provided.The shortcomings of structured electronic patient record lead to that domestic electronic health record is structurized to be developed and have some setbacks Profit, so a lot of hospital still continues to use the electronic medical record system of free text input, such electronic health record is a papery Data, to the transcription of electronic data, is unfavorable for the data analysis of profound level.A lot of electronic medical record systems are in input process Do not have perfect standard can follow, ununified specification yet, so for later data exchange, Data Integration, data It is all a potential obstacle for analysis.But want to accomplish in one move, the standardization realizing all data is also unpractical, How existing non-structural, nonstandard numbers according on accomplish the structuring of data, standardization here it is a comparison is significant Thing.The data of needs only after structuring, could be further extracted according to structurized relevant information, and to the number extracting According to being analyzed, so smooth development to clinical medicine correlative study activity provides due help.

(1) in electronic health record, name Entity recognition except name identification, place name identification etc., also disease name identify, The identification of symptom title, operation names identification, nomenclature of drug identification etc..It is the language using mark based on the method for statistical learning Material is trained, and therefore the mark of language material does not need too many domain knowledge.At present, the method has been widely used for nature language Speech process field.Conventional Statistical learning model includes SVMs (Support vector machine, SVM), hidden horse Er Kefu (Hidden Markov model, HMM), maximum entropy markov (Maximum ectropy Markov model, MEMM), condition random field (Conditional random field CRF) etc..This characteristic of hidden Markov model is permissible In the automatic word segmentation and part-of-speech tagging of Chinese.The method of HMM is also used in other Chinese word segmenting methods, wherein by word word-building Chinese word segmenting method be exactly one kind therein, and achieve good effect, the method by word word-building is that N.Xue et al. carries Go out, its main thought is the classification problem that participle process is regarded as word, conventional method is all first to set up a dictionary, participle Process be actually to carry out participle by looking up the dictionary, but then different by word word-building, it each Chinese character constitute word A location (lexeme) can be corresponded to.In general can be described as：(I), suffix (E), list in prefix (B), word Alone become word (S).Conditional random field models are on the basis of hidden Markov and maximum entropy model, proposition for marking and cutting Divide the conditional probability model of ordered data, it is a kind of discriminate probability non-directed graph learning model.CRF has been successfully applied to certainly So field such as Language Processing (Natural Language Processing, NLP), bioinformatics and network intelligence.(2) lead to Cross Entity recognition featured terms out, some meanings are similar or close, or even the meaning is just the same, for no other reason than that operating personnel Have input what term lack of standardization caused.Such as coronary stenting and coronary artery stent implantation, actually refer to generation The same meaning.Lack of standardization due to inputting, lead to system to extract two different featured terms.Therefore, by calculating feature Similarity degree between term is come feature of standardizing.

Content of the invention

For solve above-mentioned deficiency of the prior art, it is an object of the invention to provide after a kind of electronic health record structuring and The implementation method of auxiliary diagnosis and its system, realize the structuring of medical record information by structuring typing mode.

The purpose of the present invention is to be realized using following technical proposals：

The present invention provides the implementation method of structuring and auxiliary diagnosis after a kind of electronic health record, and it thes improvement is that, Described implementation method comprises the steps：

(1) electronic health record text structureization is processed；

S11：Set up Medical Dictionary；

S12：Set up medical science corpus；

S13：Medical features term process；

(2) auxiliary diagnosis management；

S21：Determine the feature word frequency that featured terms collection and electronic medical record document are constituted；

S22：Feature word frequency is carried out with PU train and carry out PU study；

S23：Draw auxiliary diagnosis result.

Further, in described step S11, described Medical Dictionary includes：

Standard medical dictionary, including：What the whole world was general is the 10th revised edition《The world of diseases and related health problems Statistical classification》ICD-10, International Classification of Diseases：The 9th edition this ICD-9-CM of clinical modification of operation and operation, medical system name The data of method-clinic term SNOMED CT is as standard；

Clinical medicine application dictionary, including：Internal dictionary and thesaurus, described internal dictionary includes clinical condition written complaint Allusion quotation and other the related dictionaries checking term；

Described thesaurus includes：Non-standardization featured terms arrive to the mapping to standardization featured terms, mistake word The standardization mapping of featured terms and sole criterion term are to the mapping of multiple standard terminologys.

Further, in described step S12, set up medical science corpus and comprise the steps：

S121：Electronic medical record document is extracted from electronic health record database；

S122：Electronic medical record document is carried out with part-of-speech tagging and lexeme mark；

S123：Data Integration is carried out to the document after part-of-speech tagging and lexeme mark；

S124：Make feature templates, feature templates are formed by CRF Algorithm for Training；

S125：Form characteristic, and carry out the recruitment evaluation of CRF algorithm；

S126：Ultimately form medical science corpus.

Further, in described step S122, described part-of-speech tagging refer to extract electronic medical record document carry out pre- Process, obtain the part of speech of electronic medical record document Chinese version, and combine lexeme mark, be converted into condition random field CRF form, be used in combination Condition random field CRF algorithm carries out feature extraction；By manual type, the electronic medical record document after automatic marking is checked；

Described lexeme mark, increases the hit probability to electronic medical record document Chinese version using standard medical dictionary, uses (reverse maximum matching method starts coupling scanning from the end of processed document to reverse maximum matching algorithm, takes least significant end every time 2i character (i word word string), as matching field, if it fails to match, removes a word of matching field foremost, continuation Join), medical terminology simultaneously carries out automatic marking according to I and suffix E in prefix B, word；

Carry out CRF Algorithm for Training, if training process in described step S124：%CRF_test-m model test.data >Output.txt, the result of training is in output.txt；Assess the contrast of label to be predicted and prediction label；

During output.txt exports in CRF algorithm, space is TAB key, all replaces with real space bar； Conlleval.pl identification is space bar；

In described step S125, the evaluation criteria of the recruitment evaluation of CRF algorithm is：

TP, True Positive：It is positive positive sample by model prediction；

FP, False Positive：It is positive negative sample by model prediction；

FN, False Negative：It is negative positive sample by model prediction；

TN, True Negative：It is negative negative sample by model prediction；

Accuracy:P=TP/ (TP+FP)；

Recall rate：R=TP/ (TP+FN), i.e. real rate；

F1, compressive classification rate：Precision ratio and the harmonic-mean of recall ratio, equal to P, little that of R two number:F=2* P*R/(P+R).

Further, in described step S13, medical features term process comprises the steps：

S131：Through the electronic medical record document of the process of CRF algorithm, obtain text, mark inside described text Each word positional representation in the text in note test set data：I and suffix E in prefix B, word, is obtained special by corresponding program Collection is closed, and except there being some to be original word in dictionary in Partial Feature word in described characteristic set, has some to be not phase Close original featured terms inside dictionary, be the feature templates that CRF passes through manually to mark, the feature obtaining after carrying out data training Word, i.e. so-called unregistered word；

S132：Featured terms set is obtained, the inside comprises the featured terms of specification and nonstandard feature after feature extraction Term, in conjunction with non-standardization featured terms to standardization featured terms mapping thesaurus, by nonstandard featured terms with In thesaurus, non-standard featured terms carry out similarity-rough set, after comparing obtain similarity ranking and Similarity value by According to order arrangement from big to small；

S133：The threshold value of similarity is tentatively set to similarity and is more than or equal to 50%, the non-rule of threshold condition will be met Model featured terms and corresponding specification features term are recommended operating personnel as candidate feature term and are carried out reference, by operating Personnel determine non-standard featured terms corresponding specification features term, as final specification features term；The size of threshold value by Manually freely arrange.

Further, in described step S132, weight featured terms being occurred in all electronic medical record document (uses TF-IDF method is calculating) add up, finally obtain average in all electronic medical record document for each featured terms, then Ranking from big to small；

In described step S133, measure to calculate featured terms in individual features term set using feature text similarity Similarity, finally take the geometric mean=comprehensive similarity formula meter of (chinese character distance+phonetic distance+five distances) Calculate；

Chinese character distance, phonetic distance, in five distances respectively using character string similar (Jaro-Winkler) away from With a distance from+string editing, calculating similarity, the mean value finally taking two kinds of distances is as two kinds of similarities for two kinds of distances Distance metric.

Further, described step S22 includes：Described feature word frequency is by positive example document data collection and not mark number of files Test data set according to collection composition；From positive example document data collection with do not mark document data focusing study, using set P and U Practise positive example document and the counter-example document that framework is distinguished in test data set, i.e. PU study, wherein P represents positive example document data collection Close, U represents the unlabeled data set of counter-example document composition；In the case of not carrying out counter-example document marking, study obtains one Individual grader, is labeled to not marking document data collection with described grader, the document required for obtaining.

Further, in described step S22, the medical record data clarified a diagnosis as determination disease is identified to form positive example Document data collection, does not mark document data collection formation training set in conjunction with the medical record data not marked and is learnt, using PU The grader that habit framework obtains is labeled to electronic medical record document from now on, reaches the purpose of auxiliary diagnosis.

Present invention additionally comprises structuring and assistant diagnosis system after a kind of electronic health record, it thes improvement is that, described System includes：

Medical Dictionary management module：For to standard dictionary management and clinical medicine application dictionary management；Described medical science is faced Bed application dictionary, including：Internal dictionary and thesaurus, described internal dictionary includes clinical symptoms dictionary and checks term Other related dictionaries；Described thesaurus includes：Non-standardization featured terms are used to the mapping to standardization featured terms, mistake Word is to the standardization mapping of featured terms and sole criterion term to the mapping of multiple standard terminologys；

Medical science language material database management module：For to the extraction of electronic medical record document data, part-of-speech tagging and lexeme mark；And Make feature templates, feature mark and feature extraction；

Medical features term process：For the standardized management to featured terms；

Auxiliary diagnosis management module：Learn for the management of PU learning framework, PU learning training and test and management and PU Auxiliary diagnosis manage.

In order to have a basic understanding to some aspects of the embodiment disclosing, shown below is simple summary.Should Summarized section is not extensive overview, is not the protection domain that will determine key/critical component or describe these embodiments. Its sole purpose is to assume some concepts with simple form, in this, as the preamble of following detailed description.

Compared with immediate prior art, the excellent effect that the technical scheme of present invention offer has is：

The present invention is by calculating the similarity degree between featured terms come feature of standardizing.Measures characteristic text phase of the present invention Use the combination of several distance metrics like degree method：Jaro-Winkler (Winkler) distance is to weigh two characters Similitude between string, it is the variant of Jaro distance metric, for repeating the detection recording.String editing distance is character String editing distance refer to certain character be changed into another one character string minimum need how many times replace, insertion, deletion action.Using The geometric mean of (chinese character distance+phonetic+five distances of distance) is measured as last comprehensive similarity.Feature ranking makes Realized with the method for TF-IDF (Term frequency inverse document frequency), TF-IDF is one Plant statistical method, in order to assess the significance level that featured terms are with respect to one of file set or corpus document, feature art The number of times that the importance of language is occurred in the document to it is directly proportional, and the frequency being occurred in corpus with it is inversely proportional to.According to The featured terms generating, are converted into the file format that PU (positive example data set and no labeled data focusing study) learns, through PU Study, system recommends the diagnosis of correlation automatically for clinical worker reference.

For above-mentioned and related purpose, one or more embodiments include will be explained in and in claim below In the feature that particularly points out.Description below and accompanying drawing describe some illustrative aspects in detail, and its instruction is only Some modes in the utilizable various modes of principle of each embodiment.Other benefits and novel features will with The detailed description in face is considered in conjunction with the accompanying and becomes obvious, the disclosed embodiments be intended to including all these aspects and they Equivalent.

Brief description

Fig. 1 is structuring and assistant diagnosis system after the electronic health record of the first optimal technical scheme that the present invention provides Structured flowchart；

Fig. 2 is the Medical Dictionary structure chart that the present invention provides；

Fig. 3 is the multi-standard term synthesis schematic diagram that the present invention provides；

Fig. 4 is the schematic diagram of the medical domain corpus Establishing process that the present invention provides；

Fig. 5 is the schematic diagram of the condition random field CRF algorithmic format that the present invention provides；

Fig. 6 is the flow chart of the language material lexeme mark that the present invention provides；

Fig. 7 is that the CRF that the present invention provides trains file format by the schematic diagram of word word-building；

Fig. 8 is the feature templates 1 of present invention offer and the schematic diagram of feature templates 2；

Fig. 9 is the schematic diagram of the featured terms handling process that the present invention provides；

Figure 10 is the schematic diagram of the non-standard featured terms mark that the present invention provides；

Figure 11 is the auxiliary diagnosis flow chart that the present invention provides；

Figure 12 is the PU study schematic diagram without category of the second optimal technical scheme that the present invention provides；

Figure 13 is the study schematic diagram of the PU with category of the second optimal technical scheme that the present invention provides；

Figure 14 is the positive example document recall rate schematic diagram of the second optimal technical scheme that the present invention provides；

Figure 15 is the positive example document accurate rate schematic diagram of the second optimal technical scheme that the present invention provides；

Figure 16 is the F-Value value schematic diagram of the second optimal technical scheme that the present invention provides；

Figure 17 is the overall accuracy schematic diagram of the second optimal technical scheme that the present invention provides；

Figure 18 is the comprehensive similarity recall rate and accurate rate schematic diagram that the present invention provides.

Specific embodiment

Below in conjunction with the accompanying drawings the specific embodiment of the present invention is described in further detail.

The following description and drawings fully illustrate specific embodiments of the present invention, to enable those skilled in the art to Put into practice them.Other embodiments can include structure, logic, electric, process and other change.Implement Example only represents possible change.Unless explicitly requested, otherwise individually assembly and function are optional, and the order operating can To change.The part of some embodiments and feature can be included in or replace part and the feature of other embodiments.This The scope of the embodiment of invention includes the gamut of claims, and all obtainable of claims is equal to Thing.Herein, these embodiments of the present invention individually or generally with term " invention " can be represented, this is only For convenience, and if in fact disclosing the invention more than, the scope being not meant to automatically limit this application is to appoint What single invention or inventive concept.

First optimal technical scheme：

As shown in figure 1, structuring and auxiliary are examined after the electronic health record of the first optimal technical scheme providing for the present invention The structured flowchart of disconnected system, the present invention provides the implementation method of structuring and auxiliary diagnosis after a kind of electronic health record, realization side Method comprises the steps：

(1) electronic health record text structureization is processed, including：

S11：The foundation of relevant medical dictionary：

Because participle instrument is generally not that it is special that carried dictionary can not possibly comprise most of medical science towards medical speciality field With term, the present invention, in order to rapidly set up related dictionary, employs the partial data of ICD10, ICD-9-CM, SNOMED CT As standard, constitute Medical Dictionary in conjunction with hospital clinical application dictionary.As shown in Figure 2.

Medical Dictionary includes：

1st, hospital clinical application dictionary, including：Internal dictionary and thesaurus, described internal dictionary includes clinical symptoms Dictionary and other the related dictionaries checking term；

(1) internal dictionary：

Clinical symptoms dictionary：

For example：Chilly, heating, shiver with cold, cough, expectoration, headache, headache, giddy, nasal obstruction, runny nose, uncomfortable in chest, asthma, abdomen Bitterly, abdominal distension, frequent micturition, urgent urination, DOMS, malaise, weak, expiratory dyspnea, spitting of blood etc..

Other related dictionaries：

Various inspection terms, such as full rabat, chest CT etc..

(2) synonymicon, including：Non-standardization featured terms are to the mapping to standardization featured terms, mistake word To the standardization mapping of featured terms and sole criterion term to the mapping of multiple standard terminologys.

During writing electronic health record, due to the difference of clinician's medical ground, grasp medical science relevant knowledge Qualification is different, so the degree of clinician's grasp standard medical terminology is also different.Each doctor's accurate perception is allowed to own Standard clinical term do not meet actual conditions, also have during typing simultaneously clerical mistake produce, so consider with Adopted word dictionary should comprise three below part, and these three partly can be incorporated in a dictionary：

Non-standardization featured terms to standardization featured terms mapping, as shown in table 1.

Table 1 non-standardization featured terms-standardization featured terms mapping

Non-standardization featured terms	Standardization featured terms
		The sick Crohndisease of clone	Crohn disease (regional ileitis)
Kernig's sign	Kernig sign
		Hemoptysis, cough up phlegm	Spitting of blood, expectoration
Antibiotic	Antibiotic
		Anti-inflammatory treatment	Anti-infective therapy
Cranial nerve	Cranial nerve
		Presbyopic	Presbyopia
Lymph gland	Lymph node
		Presenium disease is stayed	Alzheimer disease
Frozen section	Freezing microtome section
		Rale	Sound
Lymphoblast	Lymphoblast
		Mould	Fungi

Mistake word to standardization featured terms mapping, as shown in table 2.

The mapping of the wrong word-standardization featured terms of table 2

Mistake word	Standardization featured terms
		Tang's urine disease \| sugared ornithosis	Diabetes
Spontaneous immunity	Autoimmunity

It is also contemplated for when actually used for Tables 1 and 2 being merged into a dictionary, i.e. thesaurus.

A kind of situation is also had to be exactly that some term has multiple standards expression, using wherein any one is all specification , but reality during structurized it should the method with reference to SNOMED CT sets up a dictionary it is simply that sole criterion art The mapping of language and multiple standard terminology is it is also possible to regard as non-standardization featured terms to standardization featured terms this situation Mapping special circumstances it is also possible to and table 1, table 2-in-1 and in a thesaurus, as shown in Figure 3.

S12：The foundation of relevant medical corpus：Set up and safeguard the corpus of medical domain.As shown in figure 4, under including State step：

S126：Ultimately form medical science corpus.

Specifically：

In step S121, electronic medical record document is extracted and is included：

Because the fairly large corpus of artificial mark is relatively difficult, the mode that man-computer cooperation is contemplated herein is with fast run-up A vertical small-scale corpus, comprises the following steps that：

1st, pass through to have artificially collected 887 parts of electronic medical record document, cover the section office such as division of cardiology, oncology, division of respiratory disease Patient data.

2nd, (main suit), (present illness history), (past medical history), (laboratory and the apparatus inspection of each patient is automatically extracted by program Look into) text data that is related to, as original process file.

3rd, last, carry out the automatic marking of text on this basis using corresponding instrument, then carry out manual examination and verification mark Method, can rapidly build a corpus.

In step S122：

First, the part-of-speech tagging of language material：

Chinese Academy of Sciences's ICTCLAS Words partition system is the Chinese lexical analysis system based on level hidden Markov model.System Function is more, mainly has the functions such as part-of-speech tagging, Chinese word segmentation, name Entity recognition, unknown word identification, can be with plug-in user Dictionary, extensively applies in the every field of Chinese information processing.

The present invention utilizes the correlation function of ICTCLAS, carries out secondary development, for the pretreatment before being labeled.This mould The purpose of block design is the part of speech of quick obtaining text, so that next step use condition random field carries out feature extraction.Selection portion Point effect shows as follows：

【Master/a tells/v：/ w cough/v expectoration/expiratory dyspnea/n3/n days/q of n companion/v./ w is existing/t medical history/n：/ w3/n days/q Before/f patient/n is in hospital in/p our hospital/n breathing/v section/n/v during/f appearances/v cough/v ,/w expectoration/n ,/w independently/v row/v Phlegm/n difficulty/a ,/w need/v auxiliary/v row/v phlegm/n ,/w is /p is a large amount of/m grey/n mucus/n phlegm/n ,/w not /d is shown in/v phlegm/n In/f band/v blood/n.During/w/n has/v expiratory dyspnea/n companion/vSPO2/x decline/v (/w minimum/a70%/m)/w ,/w gives/and v turns over After the body/v bat/v back of the body/v suction/v phlegm/n/f improvement/v.In/w the course of disease/n/f no/v heating/v ,/w no/v nausea/a vomiting/n ,/w No/v drop in blood pressure/n ,/w no/v spitting of blood/n ,/w no/v is black/a just/n./ w chest/nCT/x shows/v (/w2013-6-15/m)/ w：/ w is slow/and a props up/q changes/v pulmonary emphysema/n companion/v infection/v ,/w two/m pulmonary fibrosis/n ,/w both sides/f pleura/n plumpness/a companion/ V pleural effusion/n ,/w fall/v sustainer/n increasing/v width/a./w】

In order to meet the requirement of the form to file for the CRF++-0.53 secondary development, using computer program by ICTCLAS Word segmentation result be converted into the form specified, as shown in Figure 5.

2nd, the lexeme mark of language material

In order to obtain the necessary corpus of CRF study, lexeme mark must be carried out to all words in document, it is apparent that passing through The mode of artificial mark less efficient it is considered to be solved with the quick notation methods of computer.Need when mark to use related doctor The standard dictionary in field, system is by the term increase of ICD10, ICD-9-CM, SNOMED, SNOMED CT, synonymicon etc. To in dictionary, to increase the hit efficiency of participle.Diagnosis, the relevant medical term length performed the operation, check are typically long, use (I), suffix (E) in reverse maximum matching algorithm foundation prefix (B), word, carry out automatic marking, because dictionary can not possibly comprise All of standard medical term, so after carrying out dictionary matching, by manually carrying out to the corpus after computer automatic marking Verification, as shown in Figure 6.By the result of word word-building, corresponding CRF training file format is as shown in Figure 7.

In step S124, the feature templates 1 that the present invention provides and feature templates 2 are as shown in Figure 8.

In step S125, the recruitment evaluation of CRF algorithm includes：

If training process：%CRF_test-m model test.data>output.txt

The result of training is in output.txt.Assessment is the contrast of label to be predicted and prediction label.

conlleval.pl<output.txt

.pl suffix is Perl file, so needing to install " practical form extraction language " (Practical Extraction And Report Language, Perl)

Note：Output.txt space in CRF++ output is TAB key, needs all to replace with real space bar. Conlleval.pl identification is space bar.

The assessment result contrast of the assessment result of command set output characteristic template 1 and feature templates 2, as shown in table 3.

Table 3 template contrasts

Evaluation criterion：TP(True Positive)：It is positive positive sample by model prediction；

FP(False Positive)：It is positive negative sample by model prediction；

FN(False Negative)：It is negative positive sample by model prediction；

TN(True Negative)：It is negative negative sample by model prediction；

Accuracy (Precision):P=TP/ (TP+FP)；

Recall rate (Recall):R=TP/ (TP+FN), i.e. real rate；

F1 (compressive classification rate)：Precision ratio and the harmonic-mean of recall ratio, closer to P, R two number less that Individual:F=2*P*R/ (P+R)；

Conclusion：Be the effect of feature templates 2 more preferably, reason is that feature templates 2 can obtain more validity features.

The process of the featured terms of step S13：The featured terms generating after processing are further processed, to obtain Meet the featured terms of PU study requirement.As shown in figure 9, comprising the steps：

S131：Through the process of CRF algorithm, a text can be obtained, inside this document, be labelled with test set data In each word positional representation in the text：(M), suffix (E) in prefix (B), word, obtain a feature by corresponding program Set, in this characteristic set, some Feature Words are not original featured terms inside related dictionary, are that CRF algorithm passes through The feature templates of artificial mark, the Feature Words obtaining after carrying out data training, that is, so-called unregistered word.

S132:A featured terms set can be obtained after feature extraction, the inside both comprised specification featured terms it is also possible to Contain nonstandard featured terms, be at this moment accomplished by with reference to thesaurus as shown in table 1, by these featured terms with synonymous In dictionary " non-standard featured terms " this carry out similarity-rough set, have a similarity ranking and similar after relatively Angle value arranges according to order from big to small.

S133:Because the initially not ready-made standard data set of thesaurus may be referred to, in order to set up from scratch One thesaurus, needs the threshold value tune of similarity is lower, as long as being tentatively set to similarity be just more than or equal to 50% " the non-standard featured terms " that meet threshold condition and corresponding " specification features term " are recommended as candidate feature term Carry out reference to operating personnel, determine which corresponding specification term of non-standard featured terms of selection as final by artificial Specification term.

Table 4 non-standard term-specification term mapping

Non-standardization featured terms	Standardization featured terms
		Tang's urine disease \| sugared ornithosis	Diabetes

Get more and more with the featured terms in the data acquisition system of thesaurus, what this when, threshold value can be adjusted is high, Advantage of this is that, only when the corresponding featured terms of featured terms that similarity is higher than a certain threshold value just can be shown in candidate Operating personnel's reference is supplied, if featured terms do not have corresponding candidate feature term through similarity-rough set in featured terms list Occur can selecting in lists, at this time by way of manual confirmation, this feature term can be modified as the rule specified Model featured terms.Note：The size of threshold value can manually freely to arrange by system, so relatively flexibly.

If typing " Tang's urine disease ", this word is exactly one and typically inputs the word leading to lack of standardization, " Tang as shown in Table 4 The corresponding candidate's non-standard featured terms of urine disease " are " Tang's urine diseases ", can be found according to this candidate's non-standardization featured terms " diabetes ", this is only final specification features term, as shown in table 5：

Featured terms before table 5 specification

Pant	Expectoration
		Heating	Spitting of blood
Weak	Asthma
		Pulmonary infection	Full rabat
Hepatitis	Rabat
		Infection	Diabetes
Hypertension	Tang's urine disease
		Coronary stenting	Chilly
Coronary artery stent implantation	Auricular fibrillation
		Coronary heart disease	Chest CT
Shiver with cold	Uncomfortable in chest
		Cough	Pectoralgia
Leucocyte	WBC

As shown in table 5, featured terms " Tang's urine disease " corresponding to non-standard featured terms in synonymicon (table 4) are " Tang's urine disease " and " sugared ornithosis " this entry.Can extract out corresponding standardization featured terms " diabetes " by this mapping relations. So just can learn that " disease is urinated by Tang " and " diabetes " are different, then non-standard featured terms " Tang's urine disease " be marked with eye-catching color Out, prompting clinician revises.As shown in Figure 10.

Specifically：

In step S133, the similarity of featured terms processes and includes：

According to the featured terms set extracting, detect in the corresponding thesaurus of each of which featured terms and do not advise Model featured terms carry out similarity comparison, and specific method will be used feature text similarity and measure to calculate the phase of individual features Like degree.According to article, finally take the geometric mean=synthesis phase of (chinese character distance+phonetic+five distances of distance) Like degree although the algorithm of similarity is similar, it is defeated due to considered the Feature Words under actual conditions having quite a few Enter what mistake caused, wherein just include homophonic (unisonance, nearly sound), the mistake of similar Chinese character (nearly word form such as such as radical), this When this comprehensive similarity just can also be obtained in that while improving similar duplication detection algorithm recall ratio and higher look into standard Rate.As shown in figure 18：

Respectively using Jaro-Winkler distance+word in chinese character distance, phonetic distance, three kinds of methods of five distances Symbol two kinds of distances of string editing distance, to calculate similarity, finally take the distance degree as two kinds of similarities for the mean value of two kinds of distances Amount.As shown in table 6：

6 three kinds of similarity comparison of table

Illustrate this two featured terms very close it may be considered that only being represented with one of specification term.By this Method also can find out the synonym phrase for standard terminology of the easy appearance in routine use it may be considered that being added to synonymous The vocabulary of dictionary is enriched in dictionary.

Featured terms after specification are as shown in table 7 below：

Featured terms after table 7 specification

Pant	Expectoration
		Heating	Spitting of blood
Weak	Asthma
		Pulmonary infection	Full rabat
Hepatitis
		Infection	Diabetes
Hypertension
		Coronary stenting	Chilly
	Auricular fibrillation
		Coronary heart disease	Chest CT
Shiver with cold	Uncomfortable in chest
		Cough	Pectoralgia
Leucocyte

Featured terms ranking：Through process above, the featured terms being extracted have a lot.However, not all carry The feature taking is all meaningful, therefore, it can consider to come feature is carried out by way of TF-IDF ranking and filter out crucial spy Levy.Because not being that every article all of Feature Words all can, in order to obtain the ranking of key feature term, examine herein Worry occurs in the weight in all document d all key feature terms and adds up, and finally obtains each featured terms in institute There is the average in document, then ranking from big to small.Herein through CRF++ instrument extraction feature term 390 altogether, Ran Hougen Carry out ranking according to the average weight calculating, through confirmation and the screening of domain expert, final acquisition key feature term 68 Individual.Front 20 featured terms listed by table 8.

Table 8 featured terms ranking

Second optimal technical scheme：

Auxiliary diagnosis manage：

According to the featured terms generating, it is converted into the file that PU (positive example data set and no labeled data focusing study) learns Form, through PU study, system recommends the diagnosis of correlation automatically for clinical worker reference.As shown in figure 11：

S23：Draw auxiliary diagnosis result.

Specifically：

Step S22：The application that part educational inspector practises：

Partial supervised study is generally divided into two kinds：The first learning tasks is from marking and no learned labeled data Practise, also known as doing LU study, wherein L represents labeled data collection, and U represents unlabeled data collection.Second learning tasks are from just Example data set and no labeled data focusing study, i.e. PU study, wherein P represents positive example set, and U represents unlabeled set and closes, algorithm Purpose be in the case of not carrying out negative data mark, acquire an accurate grader.

In actual applications, need to distinguish positive example document from the collection of document of a mixing.And the literary composition of this mixing Both contain positive example document in shelves set, also contains the document of other classifications.Wherein, the corresponding document of classification interested Referred to as positive example document；The corresponding document of remaining classification is referred to as counter-example document.All of positive example document constitutes positive example set P；Institute Some counter-example documents constitute no mark set U.

Problem definition is intended to find out a grader, can distinguish the positive example document in test set by using set P and U With counter-example document.The method of this solve problem is PU study.

This learning framework is based on such a fact：Current internet is prevailing, due to people in most of the cases only Wherein certain class document or web page contents are interested in, and other category documents or web page contents are not relevant for.In mark In the case of a small amount of document of interest, it is possible to use PU learning framework obtains a grader, come to having no that document carries out with it Mark, thus the document required for obtaining.For example some people are interested in the webpage of friend-making sites, this be every other webpage all Counter-example webpage can be seen as.

In medical research, this situation is also often had to occur, that is, certain disease is more difficult according to the diagnosis of some features, but this Planting disease is just interested to clinical workers.The medical record data that fraction is clarified a diagnosis as this kind of disease is identified shape Become positive example collection of document, then, be that unlabeled data set forms training set in conjunction with the medical record data not marked in a large number Practising, using the grader that PU learning framework obtains, medical history information from now on being labeled, thus reaching the mesh of auxiliary diagnosis 's.

The present invention also provides structuring and assistant diagnosis system after a kind of electronic health record, including：

First, experimental framework and result

1st, test used tool

(1) PU learning tool LPU (http://www.cs.uic.edu/～liub/LPU/lpu.zip).

(2) SVMs kit goes out to download SVMlight (SVMs) kit

(3) experiment order and parameter

lpu-s1[option 1]-s2[option 2]-c[option 3]-f[filestem]

-s1:Represent the first stage parameter options of PU study.

-s2:Represent the second stage parameter options of PU study.

-c:The mode of selection sort device.

- s1 has three kinds of methods can select be respectively：Espionage act (spy), Luo Jiao (roc), naive Bayesian (nb).S2 has two methods can select be respectively：SVMs (svm), expectation are maximum (em).Selection sort device Mode：1 represents best one in selection institute generation grader.

2nd, the file format of experimental data set

Three original data sets are respectively：

demo.pos：Represent positive example collection of document.

demo.unlabel:Represent and do not mark collection of document.

Above-mentioned two file does not all comprise category, as shown in Figure 12.

demo.test:Represent test data set.Both included positive example document and comprised counter-example document, also wrapped simultaneously

Contain category, positive example is represented with+1, negative example is represented with -1, as shown in Figure 13.

Each row of data form in data file：Category attribute：Property value ... attribute:Property value.Category value：+ 1 and- 1, represent positive example document and counter-example document respectively.Each category and property value between use space-separated, each attribute must Must be numbered with integer, from 1 open numbering.Each property value must use integer value, represents that each attribute occurs in affiliated literary composition Number of times in shelves.Property value is that 0 feature will be automatically ignored.Attribute number must arrange according to incremental order, and such as 5:1 6:1 7:1 8:1 10:4 11:2 12:3 13:1 14:1 15:1 16:6 17:2 23:1 25:2 29:1.

3rd, experimental data

(1), experimental data is constituted：

Effective electron case history 750 is collected in this experiment altogether, and content is related to respiratory disease, takes out through the feature of system Take, obtain efficient diagnosis as shown in table 9 below：

Table 9 efficient diagnosis

The characteristic attribute value extracting is as shown in table 10 below：

Table 10 characteristic attribute

Full rabat	Coronary heart disease	Upper right Lung infection
			Chest CT	Pulmonary emphysema	Atelectasis
Asthma	Precordialgia	Bronchiectasis
			Cough	Become thin	Malignant pleural effusion
Bronchial astehma	Chilly	Interstitial pneumonia
			Pleural effusion	Expiratory dyspnea	Ventilatory dysfunction
Runny nose	Shortness of breath	Stridulate
			Pant	Heating	Obstructive pneumonia
Shiver with cold	Malaise	Upper left Lung infection
			Two enhanced lung markings	Lower-left Lung infection	Enlargement of lymph nodes
Spitting of blood	Respiratory failure	Cholecystolithiasis
			Weak	Pulmonary tuberculosis	Hydropericardium
Uncomfortable in chest	Pneumothorax	Sneeze
			Nasal obstruction	Expectoration	Hydropneumothorax
Pectoralgia	Chronic bronchitis	Hypertension
			Headache	Bottom right Lung infection	Oedema
Acute bronchitis	DOMS	Calcification of lymph node
			The infection of the upper respiratory tract	Spontaneous pneumothorax	Pleural calcification
Palpitaition	AECB	Diabetes
			Vomiting	COPD	Lose weight
Nausea	It is short of breath	Bronchiostenosis
			Dizzy	Edema of lower extremity	Pleural effusion
Pharyngalgia	Pulmonary fibrosis

4th, experimental data packet

With COPD for positive class, choose 151 COPD case histories and constitute positive example document sets Close, and generate mzf.pos file.Select again from remaining document 49 COPD case histories and 300 other Disease type case history constitutes mzf.unlabel file.Finally, by remaining 50 COPD case histories and 200 Other diseases case history constitutes mzf.test file.

5th, participate in the combination of the grader of experiment：

Table 11 classifiers combination

In order to ensure there being applicability to different applied environments, the system that the present invention provides, when realizing, provides multiple Classifiers combination mode.For different task, select optimal classification device.

6th, experimental result and analysis：

From the point of view of the recall rate of the positive example document of Figure 14, though the positive example document recall rate of Roc-Svm is not highest, its value Also reached 90%, be more or less the same with peak.From the point of view of Figure 15, the positive example document accurate rate value of Roc-Svm is 82%, close Peak 83%, but after considering recall rate and the accurate rate of positive example document, can Roc-Svm positive example document as seen in Figure 16 F-value value reach 85.9%, be best in all graders.Additionally, from Figure 17 it is also seen that Roc-Svm obtains 94% overall accuracy rate index.It follows that for being directed to the data set of this experiment, Roc-Svm grader is optimum Grader.

During this is mainly due to learning in PU, unlabeled set closes U and generally has following characteristics：

1. close in U in unlabeled set, positive example document proportion is often less, thus without to the counter-example document in algorithm Center vector produces considerable influence.

2. close in U in unlabeled set, usually contain multiple different classes of documents, therefore in vector space, they Cover a larger region, i.e. relative distribution.And document is generally pertaining only to a classification in positive example set P, it is mutually similar Type.In vector space, they cover a less region, i.e. Relatively centralized.Assume have a decision boundary to be used for Distinguish positive example document and counter-example document.Wherein, positive example document belongs to set P, and counter-example document belongs to set U, and decision boundary is used for Ensure that the document in positive example set P and unlabeled set are closed document in U separates.Because in set U, document more disperses, so, A lot of counter-example documents are had to be divided into positive example document by mistake, this also exactly adopts Rocchio algorithm high precision rate can extract reliability instead The reason example text shelves.Therefore, after forming reliable counter-example collection of document RN, training set can be formed using RN and set P Carry out Training Support Vector Machines (SVM), continuous iteration, till no longer having reliable counter-example document to be drawn out of in certain iteration. But because many counter-example document mistakes can be divided into positive example document by Rocchio algorithm, therefore, positive example document have very low accurate Rate, and adopt SVMs (SVM) to classify, it will correct the biasing of Rocchio algorithm, thus produce more accurately dividing Class device.This is the reason also exactly Roc-Svm grader becomes optimum classifier in this experiment.

Specific experiment result is as shown in Figure 14 is to 17：

Evaluation criterion：TP(True Positive)：It is positive positive sample by model prediction；FP(False Positive)： It is positive negative sample by model prediction；FN(False Negative)：It is negative positive sample by model prediction；TN(True Negative)：It is negative negative sample by model prediction；

Accuracy (Precision):P=TP/ (TP+FP)；

Recall rate (Recall):R=TP/ (TP+FN), i.e. real rate；

Accuracy rate (Aaccuracy):The decision-making ability to whole sample for the grader, judgement that will be positive is that just negative sentences It is set to negative:A=(TP+TN)/(TP+FN+FP+TN).

Above example is only not intended to limit in order to technical scheme to be described, although with reference to above-described embodiment pair The present invention has been described in detail, and those of ordinary skill in the art still can enter to the specific embodiment of the present invention Row modification or equivalent, these without departing from any modification of spirit and scope of the invention or equivalent, all in application Within the claims of the pending present invention.

Claims

1. after a kind of electronic health record the implementation method of structuring and auxiliary diagnosis it is characterised in that described implementation method includes Following step：

(1) electronic health record text structureization is processed；

S11：Set up Medical Dictionary；

S12：Set up medical science corpus；

S13：Medical features term process；

(2) auxiliary diagnosis management；

S23：Draw auxiliary diagnosis result.

2. implementation method as claimed in claim 1 is it is characterised in that in described step S11, described Medical Dictionary includes：

Standard medical dictionary, including：What the whole world was general is the 10th revised edition《The International Statistical of diseases and related health problems Classification》ICD-10, International Classification of Diseases：Operation and operation the 9th edition this ICD-9-CM of clinical modification, Systematized Nomenclature of Medicine-face The data of bed term SNOMED CT is as standard；

Clinical medicine application dictionary, including：Internal dictionary and thesaurus, described internal dictionary include clinical symptoms dictionary and Check other related dictionaries of term；Described thesaurus includes：Non-standardization featured terms are to standardization featured terms Mapping, the mapping of mistake word to standardization featured terms and sole criterion term are to the mapping of multiple standard terminologys.

3. implementation method as claimed in claim 1 is it is characterised in that in described step S12, sets up under medical science corpus includes State step：

S126：Ultimately form medical science corpus.

4. implementation method as claimed in claim 3 is it is characterised in that in described step S122, described part-of-speech tagging refers to The electronic medical record document extracted is pre-processed, obtains the part of speech of electronic medical record document Chinese version, and combine lexeme mark, turn Change condition random field CRF form into, and carry out feature extraction with condition random field CRF algorithm；Marked to automatic by manual type Electronic medical record document after note is checked；

Described lexeme mark, increases the hit probability to electronic medical record document Chinese version using standard medical dictionary, using reverse Maximum matching algorithm, wherein, reverse maximum matching method starts coupling scanning from the end of processed document, takes least significant end every time 2i character is as matching field, if it fails to match, removes a word of matching field foremost, continues coupling；Medical science art Language simultaneously carries out automatic marking according to I and suffix E in prefix B, word；

Carry out CRF Algorithm for Training, if training process in described step S124：%CRF_test-m model test.data> Output.txt, the result of training is in output.txt；Assess the contrast of label to be predicted and prediction label；

During output.txt exports in CRF algorithm, space is TAB key, all replaces with real space bar；conlleval.pl Identification is space bar；

TP, True Positive：It is positive positive sample by model prediction；

FP, False Positive：It is positive negative sample by model prediction；

FN, False Negative：It is negative positive sample by model prediction；

TN, True Negative：It is negative negative sample by model prediction；

Accuracy:P=TP/ (TP+FP)；

Recall rate：R=TP/ (TP+FN), i.e. real rate；

F1, compressive classification rate：Precision ratio and the harmonic-mean of recall ratio, equal to P, little that of R two number:F=2*P*R/ (P+R).

5. implementation method as claimed in claim 1 is it is characterised in that in described step S13, medical features term process includes Following step：

S131：Through the electronic medical record document of the process of CRF algorithm, obtain text, inside described text, mark is surveyed Each word positional representation in the text in examination collection data：In prefix B, word, I and suffix E, obtains feature set by corresponding program Close, comprise Unrecorded featured terms in original word and dictionary in dictionary in Partial Feature word in described characteristic set, be CRF passes through the feature templates of artificial mark, the Feature Words obtaining after carrying out data training, i.e. so-called unregistered word；

S132：Featured terms set is obtained, the inside comprises the featured terms of specification and nonstandard featured terms after feature extraction, In conjunction with the thesaurus of non-standardization featured terms to standardization featured terms mapping, by nonstandard featured terms and synonym In allusion quotation, non-standard featured terms carry out similarity-rough set, after comparing obtain similarity ranking and Similarity value is according to from big To little order arrangement；

S133：The threshold value of similarity is tentatively set to similarity and is more than or equal to 50%, will be special for the non-standard meeting threshold condition Levy term and corresponding specification features term and recommend operating personnel as candidate feature term and carry out reference, by operating personnel Determine non-standard featured terms corresponding specification features term, as final specification features term；The size of threshold value is by artificial Freely arrange.

6. implementation method as claimed in claim 5 is it is characterised in that in described step S132, calculated using TF-IDF method Featured terms are occurred in the weight in all electronic medical record document add up, obtain each featured terms in all electronics disease Go through the average in document, from big to small ranking；

In described step S133, measure to calculate the phase of featured terms in individual features term set using feature text similarity Like degree；Geometric mean=comprehensive similarity the formula taking (chinese character distance+phonetic+five distances of distance) calculates；

Respectively using character string similarity distance+string editing distance in chinese character distance, phonetic distance, five distances, Two kinds of distances, to calculate similarity, finally take the distance metric as two kinds of similarities for the mean value of two kinds of distances.

7. implementation method as claimed in claim 1 is it is characterised in that described step S22 includes：Described feature word frequency is by just Example text file data collection and the test data set not marking document data collection composition；From positive example document data collection with do not mark number of files According to focusing study, distinguish positive example document and the counter-example document of test data concentration, i.e. PU using set P and U learning framework Practise, wherein P represents positive example document data set, U represents the unlabeled data set of counter-example document composition；Do not carrying out counter-example literary composition In the case of shelves mark, study obtains a grader, is labeled to not marking document data collection with described grader, obtains Required document；

The medical record data clarified a diagnosis as determination disease is identified to form positive example document data collection, in conjunction with the case history not marked Data does not mark document data collection and forms training set and learnt, and the grader being obtained using PU learning framework is to electricity from now on Sub- case history labelling document, reaches the purpose of auxiliary diagnosis.

8. after a kind of electronic health record structuring and assistant diagnosis system it is characterised in that described system includes：

Medical Dictionary management module：For to standard dictionary management and clinical medicine application dictionary management；Described clinical medicine should With dictionary, including：Internal dictionary and thesaurus, described internal dictionary include clinical symptoms dictionary and check term other Related dictionary；Described thesaurus includes：Non-standardization featured terms arrive to the mapping to standardization featured terms, mistake word The standardization mapping of featured terms and sole criterion term are to the mapping of multiple standard terminologys；

Auxiliary diagnosis management module：Auxiliary for the management of PU learning framework, PU learning training and test and management and PU study Diagnosis management.