CN109697286A

CN109697286A - A kind of diagnostic standardization method and device based on term vector

Info

Publication number: CN109697286A
Application number: CN201811551703.8A
Authority: CN
Inventors: 李玉娇; 陆王天宇; 谭炎; 吴栋梁
Original assignee: Zhongan Information Technology Service Co Ltd
Current assignee: Zhongan Information Technology Service Co Ltd
Priority date: 2018-12-18
Filing date: 2018-12-18
Publication date: 2019-04-30
Also published as: WO2020124856A1

Abstract

The diagnostic standardization method and device based on term vector that the invention discloses a kind of, this method comprises: S1: obtaining diagnosis sentence to be processed, carry out word segmentation processing to diagnosis sentence to be processed, obtain word segmentation result；S2: according to word segmentation result and the corresponding model of pre- structure, term vector, word information amount and the part-of-speech tagging result of diagnosis sentence to be processed are established；S3: according to term vector, word information amount and part-of-speech tagging as a result, calculating separately the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed；S4: standardization result of the standard diagnostics sentence most like with diagnosis sentence to be processed as diagnosis sentence is chosen.The present invention is by calculating the semantic similarity for diagnosing sentence and the immediate standard diagnostics sentence of Current Diagnostic to be processed, it chooses with the highest standard diagnostics sentence of diagnosis statement similarity to be processed as standardization result, and periodically corresponding model is updated, improve the accuracy of diagnostic standardization.

Description

A kind of diagnostic standardization method and device based on term vector

Technical field

The present invention relates to technical field of data processing, in particular to a kind of diagnostic standardization method and dress based on term vector It sets.

Background technique

In terms of settlement of insurance claim and medical research statistics, the diagnostic standardization of disease has great importance.Disease at present Standard includes the ICD-10 (International Classification of Diseases) of internal authority publication.But one side Different hospital has different ICD- 10 diagnostic criteria, the hand-written diagnosis of another aspect doctor and the diagnosis of standard have a certain distance, how will be in Different hospital Nonstandard diagnosis carries out unified standardization, and be one has the problem of realistic meaning very much.

Current diagnostic standardization method has the following problems:

1, unsupervised natural language processing method is used completely, does not utilize existing mark resource, accuracy rate is lower；

2, artificial revision and mark are relied on, different addresses and ICD-10 including the same ICD-10 disease are frequently repaired It orders, a large amount of human resources need to be expended.

Summary of the invention

In order to solve problems in the prior art, the embodiment of the invention provides a kind of diagnostic standardization sides based on term vector Method and device need to expend a large amount of human resources, completely using unsupervised to overcome to rely on artificial revision in the prior art and mark Natural language processing method, do not utilize existing mark resource, the problems such as accuracy rate is lower.

In order to solve the above technical problems, the technical solution adopted by the present invention is that:

On the one hand, a kind of diagnostic standardization method based on term vector is provided, described method includes following steps:

S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains participle knot Fruit；

S2: according to the word segmentation result and the corresponding model of pre- structure, establish the word of the diagnosis sentence to be processed to Amount, word information amount and part-of-speech tagging result；

S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately the diagnosis language to be processed The similarity of sentence and relevant criterion diagnosis sentence in java standard library；

S4: mark of the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis sentence is chosen Standardization result.

Further, term vector, word information amount and the part-of-speech tagging knot for establishing the diagnosis sentence to be processed Before fruit further include:

According to the abbreviation dictionary of pre- structure, the abbreviation in the word segmentation result is converted into corresponding standard words.

Further, the step S3 is specifically included:

S3.1: according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and phase in the java standard library Close the Word similarity of standard diagnostics sentence；

S3.2: according to the Word similarity and the word information amount, diagnosis sentence and the institute of the processing are calculated separately State the similarity of relevant criterion diagnosis sentence in java standard library.

Further, the step S4 is specifically included:

The similarity of relevant criterion diagnosis sentence in the diagnosis sentence to be processed and java standard library is ranked up processing, According to ranking results, the standard diagnostics sentence most like with the diagnosis sentence to be processed is chosen as the diagnosis sentence Standardization result.

Further, the method also includes:

S5: by diagnosis and treatment-standard of correction to as training corpus, supplementary training is carried out to the corresponding model of pre- structure, and right Corresponding model after training carries out accuracy rate assessment, replaces original model with the qualified corresponding model of assessment.

On the other hand, a kind of diagnostic standardization device based on term vector is provided, described device includes:

Word segmentation module carries out word segmentation processing to the diagnosis sentence to be processed for obtaining diagnosis sentence to be processed, Obtain word segmentation result；

It constructs module and establishes the diagnosis to be processed for the corresponding model according to the word segmentation result and pre- structure Term vector, word information amount and the part-of-speech tagging result of sentence；

Computing module, for according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately described wait locate The similarity of relevant criterion diagnosis sentence in the diagnosis sentence and java standard library of reason；

Output module is examined described in the standard diagnostics sentence conduct most like with the diagnosis sentence to be processed for choosing The standardization result of conclusion sentence.

Further, the building module further include:

Replacement unit is converted to the abbreviation in the word segmentation result corresponding for the abbreviation dictionary according to pre- structure Standard words.

Further, the computing module includes:

Word similarity computing unit, for according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result With the Word similarity of relevant criterion diagnosis sentence in the java standard library；

Sentence similarity calculated, for calculating separately the place according to the Word similarity and the word information amount The similarity of relevant criterion diagnosis sentence in the diagnosis sentence of reason and the java standard library.

Further, the output module includes:

Sequencing unit, for relevant criterion in the diagnosis sentence and java standard library to be processed to be diagnosed to the similarity of sentence It is ranked up processing；

Selection unit, for according to ranking results, choosing the standard diagnostics most like with the diagnosis sentence to be processed Standardization result of the sentence as the diagnosis sentence.

Further, described device further include:

Update module, diagnosis and treatment-standard for that will correct supplement the corresponding model of pre- structure to as training corpus Training, and accuracy rate assessment is carried out to the corresponding model after training, original model is replaced with the qualified corresponding model of assessment.

Technical solution provided in an embodiment of the present invention has the benefit that

1, the diagnostic standardization method and device provided by the invention based on term vector, by calculating diagnosis language to be processed The semantic similarity of sentence and the immediate standard diagnostics sentence of Current Diagnostic is chosen and diagnosis statement similarity highest to be processed Standard diagnostics sentence corresponding model is updated as standardization result, and periodically, improve the correct of diagnostic standardization Rate；

2, the diagnostic standardization method and device provided by the invention based on term vector, is examined automatically by corresponding model Disconnected standardization greatly reduces the manpower consumption of diagnostic standardization, while can provide complete meter for subsequent accurate standardization Calculation method and process.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.

Fig. 1 is the flow chart of the diagnostic standardization method shown according to an exemplary embodiment based on term vector；

Fig. 2 be it is shown according to an exemplary embodiment according to term vector, word information amount and part-of-speech tagging as a result, respectively Calculate the flow chart of the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed；

Fig. 3 is the structural schematic diagram of the diagnostic standardization device shown according to an exemplary embodiment based on term vector.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.

Diagnostic standardization method and device provided in an embodiment of the present invention based on term vector is obtained and current by calculating (diagnosis sentence i.e. to be processed) immediate standard disease is diagnosed, it is real especially by diagnosis-disease pair semantic distance is calculated It is existing.Wherein diagnosis-disease semantic distance belongs to a semantic similarity measurement, is diagnosed and disease simple sentence, diagnosis-disease using calculating The information delta of grammatically wrong sentence pair is completed.

Fig. 1 is the flow chart of the diagnostic standardization method shown according to an exemplary embodiment based on term vector, reference Shown in Fig. 1, this method comprises the following steps:

S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains participle knot Fruit.

Specifically, in embodiments of the present invention, participle uses the search engine of jieba participle, non-HMM mode.On the one hand, Using search engine mode, the similarity calculation by term vector model can be facilitated with grain refined word segmentation result.Another party Face ensures that the word segmentation result of same sentence is identical using non-HMM mode, reduces unnecessary similarity calculation, improves speed With accuracy rate.

S2: according to the word segmentation result and the corresponding model of pre- structure, establish the word of the diagnosis sentence to be processed to Amount, word information amount and part-of-speech tagging result.

Specifically, the model constructed in advance includes term vector model, word information amount model and part-of-speech tagging model.Its In, word segmentation result is input in term vector model, the term vector of diagnosis sentence to be processed is obtained, word segmentation result is input to In word information amount model, the word information amount of diagnosis sentence to be processed is obtained, word segmentation result is input in part-of-speech tagging model, Obtain part-of-speech tagging result.

Before term vector model and word information amount model construction, need first to prepare training corpus collection and testing material collection. In embodiments of the present invention, the data and medicine corpus in ICD-10 (International Classification of Diseases) java standard library are obtained, by ICD-10 It is training corpus collection and testing material collection that data and medicine corpus in (International Classification of Diseases) java standard library, which carry out random division,. Its traditional Chinese medicine corpus can be obtained by sources such as medical diagnosis, medicine network forum, medical electronic books, need to have with medicine The accuracy of model can be improved in certain degree of correlation in this way.If can not obtain, general corpus can also be used, but can be damaged Lose the part accuracy of model.

Term vector model and word information amount model are built first, and the sentence then concentrated to training corpus carries out at participle Reason, word segmentation result is input in the term vector model put up or word information amount model, to term vector model or word information amount Model is trained, until term vector model or word information amount model are restrained.After the completion of training, then the language that testing material is concentrated After sentence carries out word segmentation processing, word segmentation result is input in trained term vector model or word information amount model, to training Term vector model or word information amount model tested, verify its accuracy.

In embodiments of the present invention, term vector model uses skip-gram model.Skip-gram is based on neural network Term vector model.According to the arrangement of word in document and there is order, word is mapped in higher dimensional space, it is empty by calculating Between similarity between the middle available word of COS distance.By in ICD-10 (International Classification of Diseases) java standard library data and doctor Skip-gram model is trained as training data after learning corpus word segmentation processing, obtains the term vector model of medical domain.

Word information amount is obtained by calculating the IDF value of related corpus.IDF refers to reverse document-frequency (Inverse Document Frequency), wherein N indicates number of files in collection of document, n_kIndicate the number of files for lexical item k occur.If packet Document containing entry is fewer, and IDF can be bigger, then illustrates that entry has good class discrimination ability, information content is higher.The present invention Weight in embodiment using IDF as connection Word similarity and sentence similarity.Wherein the calculation formula of word information amount is as follows:

Part-of-speech tagging uses Stanford POS tagger model, and the part of speech after mark includes noun, verb, predicate etc.. When calculating similarity, the word can be selected to calculate the weight in similarity according to the difference of part of speech.For example, " be, be " etc. is even The weight for connecing word is smaller, and the weight of noun be generally arranged it is larger.

S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately the diagnosis language to be processed The similarity of sentence and relevant criterion diagnosis sentence in java standard library.

Specifically, as a kind of more excellent embodiment obtained, in calculating the diagnosis sentence to be processed and java standard library Before relevant criterion diagnoses the similarity of sentence, needs first to carry out word segmentation processing to relevant criterion diagnosis sentence in java standard library, obtain mark The corresponding word segmentation result of quasi- diagnosis sentence.Then according to the corresponding word segmentation result of standard diagnostics sentence establish corresponding term vector, Word information amount and part-of-speech tagging result.Then in conjunction with the corresponding term vector of standard diagnostics sentence, word information amount and part of speech mark Note result and the corresponding term vector of diagnosis sentence, word information amount and part-of-speech tagging to be processed are as a result, calculate to be processed examine The similarity of relevant criterion diagnosis sentence in conclusion sentence and java standard library.Specifically word can be obtained by calculating COS distance in space Between similarity.Wherein, Plays of embodiment of the present invention library includes ICD-10 (International Classification of Diseases) java standard library etc..Calculate to In the diagnosis sentence and java standard library of processing before the similarity of relevant criterion diagnosis sentence, the first preliminary screening from java standard library is needed to go out Standard diagnostics sentence relevant to diagnosis sentence to be processed, the mode of preliminary screening include being retrieved by keyword.

As a kind of preferably embodiment, in the embodiment of the present invention, term vector, the word of diagnosis sentence to be processed are established Before information content and part-of-speech tagging result further include:

Specifically, common spoken language is write a Chinese character in simplified form can be mapped in standard saying by way of searching and calculating replacement, such as " upper sense "- > " the infection of the upper respiratory tract ".It needs to pre-establish abbreviation dictionary in the embodiment of the present invention, before calculating similarity, needs to compare Abbreviation dictionary is searched, the word abridged in word segmentation result is converted into the corresponding standard words found in abbreviation dictionary.

As a kind of preferably embodiment, in the embodiment of the present invention, step S4 is specifically included:

Specifically, under normal circumstances, the maximum standard diagnostics sentence of similarity with diagnosis sentence to be processed can be chosen Standardization result as diagnosis sentence.But it is approximate with a plurality of standard diagnostics in diagnosis sentence (diagnosis) to be processed In the case where, it may appear that similarity approaches or even equal situation.When similarity is lower than certain threshold value, we use specified Sort algorithm screening criteria disease.I.e. according to similarity score, the maximum phase for meeting threshold value is directly chosen when point spread is big It is sorted according to formula when point spread is small like the character (i.e. standard diagnostics sentence) of degree score and chooses most like character (i.e. standard diagnostics sentence).Formula is as follows:

Idx (T)=α logn (T)-(1- α) (len (T))

Wherein n (T) is the frequency based on historical statistics, and len (T) is that sentence is long, and α is adjustment parameter.

As a kind of preferably embodiment, in the embodiment of the present invention, method further include:

Specifically, can be periodically updated according to the regular hour to correlation model in the embodiment of the present invention.For example, When wrong division example reaches certain amount, executes more new technological process and obtain new model.Diagnosis and treatment-standard pair when update, after correction (diagnosis sentence and standard diagnostics sentence i.e. to be processed) is used as training corpus, and after carrying out word segmentation processing to it, what be will acquire divides Word result is input in the corresponding model (including term vector model, word information amount model etc.) of pre- structure, carries out supplement instruction to model Practice.In addition it before the model after training coming into operation again, also needs to carry out accuracy rate assessment to the corresponding model after supplementary training, if It meets the requirements, just corresponding model is updated.Corresponding model is regularly updated, the accuracy rate of model can be improved, from And improve the accuracy rate of diagnostic standardization.

After correlation model updates, sequence is recalculated, according to statistical result, updates the calculation formula of sort algorithm, specifically When implementation, realized by updating adjustment parameter α.

In addition to this, as a kind of preferably embodiment, in the embodiment of the present invention, can also the period be carried out to IDF file Property is updated, and the update cycle can be configured according to the specific requirements of user.

Fig. 2 be it is shown according to an exemplary embodiment according to term vector, word information amount and part-of-speech tagging as a result, respectively The flow chart for calculating the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed, referring to shown in Fig. 2, Include the following steps:

S3.1: according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and phase in the java standard library Close the Word similarity of standard diagnostics sentence.

Specifically, usually calculating similarity between word includes two methods of semantic similarity and statistics similarity.Wherein language Adopted similarity includes the conventional method based on semantic tree construction using WordNet as representative.On the one hand, WordNet is general Semantic model, does not carry out specially treated for medical data, and a large amount of medical terminologys inside can not find corresponding contents, another party Face, maintenance WordNet need special manual maintenance, and time cost is high.Therefore, in the embodiment of the present invention, ICD-10 is (international Classification of diseases) after data and medicine corpus word segmentation processing in java standard library as training data to skip-gram model (word to Amount model) it is trained, obtain the term vector model of medical domain.Term vector model is according to the arrangement and appearance of word in document Word is mapped in higher dimensional space by order, the space COS distance of two words is first calculated according to term vector, further according to part of speech mark Note result is weighted adjusting, obtains the similarity (Word similarity) between word.Bluebeard compound vector sum part-of-speech tagging is as a result, calculate When similarity, the word can be adjusted according to the difference of part of speech and is calculating the weight in similarity.For example, the connection such as " will be, be " The weight of word is set as smaller, and the weight of noun be generally arranged it is larger.

Specifically, illustratively illustrating the embodiment of the present invention is how to calculate a similarity with one, process is as follows:

Two sentences are given, (diagnosis sentence to be processed) T is diagnosed₁With disease (standard diagnostics sentence) T₂, form a joint Word set

T=T₁∪T₂={ w₁, w₂..., w_m}

Joint word set T includes to come from T₁And T₂All different terms.Such as:

T₁: right radical bone avulsion fracture

T₂: root bone fracture

Then T={ right radical bone avulsion fracture fracture }

Since joint word set is entirely derived, its not additional information from above-mentioned two sentence.Joint word Collection T can be regarded as whole semantic informations of two sentences.The expression of joint word set can be used in the semantic information of each sentence.Tool Body, vector is referred to as lexical semantic vector derived from joint word set, is expressed as S={ s₁, s₂..., s_m}.Lexical semantic to Each entry of amount corresponds to a word in joint word set, therefore the dimension of lexical semantic vector is equal to the word in joint word set Number.The value of one entry of lexical semantic vector is determined by the Semantic Similarity of corresponding word and the word in sentence 's.With T₁For:

If 1, w_iAppear in sentence T₁In, then s_iIt is set as 1, wherein [1, m] i ∈；

If 2, T₁In do not include w_i, then w is calculated using above-mentioned Word similarity calculation method_iWith sentence T₁In each list Semantic Similarity score between word.If T₁To w_iIn most like word w ' there is highest similarity score θ.If θ is more than default Threshold value, then s_i=θ；Otherwise, s_i=0, wherein [1, m] i ∈.

What needs to be explained here is that the reason of setting threshold value be because tightened up to semantic similarity in medical field, As " scald " and " burn " can be understood as in routinely semanteme is similar it is synonymous, however in medicine, " scald " and " burn " category In various disease.

In addition, different words has different degrees of contribution to the meaning of sentence.It is each to weight that therefore, it is necessary to a kind of schemes Word.In the embodiment of the present invention, its importance is measured with the information content of a word, is obtained according to the method in above-mentioned steps S2 The word information amount I (w of each word_i), to obtain weighting vocabulary semantic vector

sw_i=s_i*I(w_i)*I(w′_i)

Wherein w_iIt is T₁In word, w '_iIt is in T and w_iCorresponding word.T₁And T₂Sentence semantic similarity can pass through T₁And T₂The COS distance of lexical semantic vector be calculated, formula is as follows:

Fig. 3 is the structural schematic diagram of the diagnostic standardization device shown according to an exemplary embodiment based on term vector, Referring to shown in Fig. 3, which includes:

As a kind of preferably embodiment, in the embodiment of the present invention, module is constructed further include:

As a kind of preferably embodiment, in the embodiment of the present invention, computing module includes:

As a kind of preferably embodiment, in the embodiment of the present invention, output module includes:

As a kind of preferably embodiment, in the embodiment of the present invention, the device further include:

In conclusion technical solution provided in an embodiment of the present invention has the benefit that

It should be understood that the diagnostic standardization device provided by the above embodiment based on term vector is in triggering diagnostic criteria When change business, only the example of the division of the above functional modules, in practical application, can according to need and will be above-mentioned Function distribution is completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, with complete with The all or part of function of upper description.In addition, the diagnostic standardization device and base provided by the above embodiment based on term vector Belong to same design in the diagnostic standardization embodiment of the method for term vector, specific implementation process is detailed in embodiment of the method, here It repeats no more.

Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..

The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims

1. a kind of diagnostic standardization method based on term vector, which is characterized in that described method includes following steps:

S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains word segmentation result:

S2: according to the word segmentation result and the corresponding model of pre- structure, term vector, the word of the diagnosis sentence to be processed are established Information content and part-of-speech tagging result；

S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculate separately the diagnosis sentence to be processed with The similarity of relevant criterion diagnosis sentence in java standard library；

S4: standardization of the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis sentence is chosen As a result.

2. the diagnostic standardization method according to claim 1 based on term vector, which is characterized in that described in the foundation to Before the diagnosis term vector of sentence of processing, word information amount and part-of-speech tagging result further include:

3. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the step S3 It specifically includes:

S3.1: it is marked to related the java standard library according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result The Word similarity of quasi- diagnosis sentence；

S3.2: according to the Word similarity and the word information amount, the diagnosis sentence and the mark of the processing are calculated separately The similarity of relevant criterion diagnosis sentence in quasi- library.

4. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the step S4 It specifically includes:

The similarity of relevant criterion diagnosis sentence in the diagnosis sentence to be processed and java standard library is ranked up processing, according to Ranking results choose the mark with the most like standard diagnostics sentence of the diagnosis sentence to be processed as the diagnosis sentence Standardization result.

5. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the method is also Include:

S5: by diagnosis and treatment-standard of correction to as training corpus, supplementary training is carried out to the corresponding model of pre- structure, and to training Corresponding model afterwards carries out accuracy rate assessment, replaces original model with the qualified corresponding model of assessment.

6. a kind of diagnostic standardization device based on term vector, which is characterized in that described device includes:

Word segmentation module carries out word segmentation processing to the diagnosis sentence to be processed, obtains for obtaining diagnosis sentence to be processed Word segmentation result；

It constructs module and establishes the diagnosis sentence to be processed for the corresponding model according to the word segmentation result and pre- structure Term vector, word information amount and part-of-speech tagging result；

Computing module, for according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately described to be processed Diagnose the similarity of relevant criterion diagnosis sentence in sentence and java standard library；

Output module, for choosing the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis language The standardization result of sentence.

7. the diagnostic standardization device according to claim 6 based on term vector, which is characterized in that the building module is also Include:

Abbreviation in the word segmentation result is converted to corresponding standard for the abbreviation dictionary according to pre- structure by replacement unit Word.

8. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that the calculating mould Block includes:

Word similarity computing unit, for according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and institute State the Word similarity of relevant criterion diagnosis sentence in java standard library；

Sentence similarity calculated, for calculating separately the processing according to the Word similarity and the word information amount Diagnose the similarity of relevant criterion diagnosis sentence in sentence and the java standard library.

9. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that the output mould Block includes:

Sequencing unit, for carrying out the similarity of relevant criterion diagnosis sentence in the diagnosis sentence and java standard library to be processed Sequence processing；

Selection unit, for according to ranking results, choosing the standard diagnostics sentence most like with the diagnosis sentence to be processed Standardization result as the diagnosis sentence.

10. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that described device is also Include:

Update module, diagnosis and treatment-standard for that will correct carry out supplement instruction to the corresponding model of pre- structure to as training corpus Practice, and accuracy rate assessment is carried out to the corresponding model after training, replaces original model with the qualified corresponding model of assessment.