CN109697286A - A kind of diagnostic standardization method and device based on term vector - Google Patents

A kind of diagnostic standardization method and device based on term vector Download PDF

Info

Publication number
CN109697286A
CN109697286A CN201811551703.8A CN201811551703A CN109697286A CN 109697286 A CN109697286 A CN 109697286A CN 201811551703 A CN201811551703 A CN 201811551703A CN 109697286 A CN109697286 A CN 109697286A
Authority
CN
China
Prior art keywords
sentence
diagnosis
word
processed
term vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811551703.8A
Other languages
Chinese (zh)
Inventor
李玉娇
陆王天宇
谭炎
吴栋梁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongan Information Technology Service Co Ltd
Original Assignee
Zhongan Information Technology Service Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongan Information Technology Service Co Ltd filed Critical Zhongan Information Technology Service Co Ltd
Priority to CN201811551703.8A priority Critical patent/CN109697286A/en
Priority to PCT/CN2019/080416 priority patent/WO2020124856A1/en
Publication of CN109697286A publication Critical patent/CN109697286A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The diagnostic standardization method and device based on term vector that the invention discloses a kind of, this method comprises: S1: obtaining diagnosis sentence to be processed, carry out word segmentation processing to diagnosis sentence to be processed, obtain word segmentation result;S2: according to word segmentation result and the corresponding model of pre- structure, term vector, word information amount and the part-of-speech tagging result of diagnosis sentence to be processed are established;S3: according to term vector, word information amount and part-of-speech tagging as a result, calculating separately the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed;S4: standardization result of the standard diagnostics sentence most like with diagnosis sentence to be processed as diagnosis sentence is chosen.The present invention is by calculating the semantic similarity for diagnosing sentence and the immediate standard diagnostics sentence of Current Diagnostic to be processed, it chooses with the highest standard diagnostics sentence of diagnosis statement similarity to be processed as standardization result, and periodically corresponding model is updated, improve the accuracy of diagnostic standardization.

Description

A kind of diagnostic standardization method and device based on term vector
Technical field
The present invention relates to technical field of data processing, in particular to a kind of diagnostic standardization method and dress based on term vector It sets.
Background technique
In terms of settlement of insurance claim and medical research statistics, the diagnostic standardization of disease has great importance.Disease at present Standard includes the ICD-10 (International Classification of Diseases) of internal authority publication.But one side Different hospital has different ICD- 10 diagnostic criteria, the hand-written diagnosis of another aspect doctor and the diagnosis of standard have a certain distance, how will be in Different hospital Nonstandard diagnosis carries out unified standardization, and be one has the problem of realistic meaning very much.
Current diagnostic standardization method has the following problems:
1, unsupervised natural language processing method is used completely, does not utilize existing mark resource, accuracy rate is lower;
2, artificial revision and mark are relied on, different addresses and ICD-10 including the same ICD-10 disease are frequently repaired It orders, a large amount of human resources need to be expended.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of diagnostic standardization sides based on term vector Method and device need to expend a large amount of human resources, completely using unsupervised to overcome to rely on artificial revision in the prior art and mark Natural language processing method, do not utilize existing mark resource, the problems such as accuracy rate is lower.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:
On the one hand, a kind of diagnostic standardization method based on term vector is provided, described method includes following steps:
S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains participle knot Fruit;
S2: according to the word segmentation result and the corresponding model of pre- structure, establish the word of the diagnosis sentence to be processed to Amount, word information amount and part-of-speech tagging result;
S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately the diagnosis language to be processed The similarity of sentence and relevant criterion diagnosis sentence in java standard library;
S4: mark of the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis sentence is chosen Standardization result.
Further, term vector, word information amount and the part-of-speech tagging knot for establishing the diagnosis sentence to be processed Before fruit further include:
According to the abbreviation dictionary of pre- structure, the abbreviation in the word segmentation result is converted into corresponding standard words.
Further, the step S3 is specifically included:
S3.1: according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and phase in the java standard library Close the Word similarity of standard diagnostics sentence;
S3.2: according to the Word similarity and the word information amount, diagnosis sentence and the institute of the processing are calculated separately State the similarity of relevant criterion diagnosis sentence in java standard library.
Further, the step S4 is specifically included:
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence to be processed and java standard library is ranked up processing, According to ranking results, the standard diagnostics sentence most like with the diagnosis sentence to be processed is chosen as the diagnosis sentence Standardization result.
Further, the method also includes:
S5: by diagnosis and treatment-standard of correction to as training corpus, supplementary training is carried out to the corresponding model of pre- structure, and right Corresponding model after training carries out accuracy rate assessment, replaces original model with the qualified corresponding model of assessment.
On the other hand, a kind of diagnostic standardization device based on term vector is provided, described device includes:
Word segmentation module carries out word segmentation processing to the diagnosis sentence to be processed for obtaining diagnosis sentence to be processed, Obtain word segmentation result;
It constructs module and establishes the diagnosis to be processed for the corresponding model according to the word segmentation result and pre- structure Term vector, word information amount and the part-of-speech tagging result of sentence;
Computing module, for according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately described wait locate The similarity of relevant criterion diagnosis sentence in the diagnosis sentence and java standard library of reason;
Output module is examined described in the standard diagnostics sentence conduct most like with the diagnosis sentence to be processed for choosing The standardization result of conclusion sentence.
Further, the building module further include:
Replacement unit is converted to the abbreviation in the word segmentation result corresponding for the abbreviation dictionary according to pre- structure Standard words.
Further, the computing module includes:
Word similarity computing unit, for according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result With the Word similarity of relevant criterion diagnosis sentence in the java standard library;
Sentence similarity calculated, for calculating separately the place according to the Word similarity and the word information amount The similarity of relevant criterion diagnosis sentence in the diagnosis sentence of reason and the java standard library.
Further, the output module includes:
Sequencing unit, for relevant criterion in the diagnosis sentence and java standard library to be processed to be diagnosed to the similarity of sentence It is ranked up processing;
Selection unit, for according to ranking results, choosing the standard diagnostics most like with the diagnosis sentence to be processed Standardization result of the sentence as the diagnosis sentence.
Further, described device further include:
Update module, diagnosis and treatment-standard for that will correct supplement the corresponding model of pre- structure to as training corpus Training, and accuracy rate assessment is carried out to the corresponding model after training, original model is replaced with the qualified corresponding model of assessment.
Technical solution provided in an embodiment of the present invention has the benefit that
1, the diagnostic standardization method and device provided by the invention based on term vector, by calculating diagnosis language to be processed The semantic similarity of sentence and the immediate standard diagnostics sentence of Current Diagnostic is chosen and diagnosis statement similarity highest to be processed Standard diagnostics sentence corresponding model is updated as standardization result, and periodically, improve the correct of diagnostic standardization Rate;
2, the diagnostic standardization method and device provided by the invention based on term vector, is examined automatically by corresponding model Disconnected standardization greatly reduces the manpower consumption of diagnostic standardization, while can provide complete meter for subsequent accurate standardization Calculation method and process.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is the flow chart of the diagnostic standardization method shown according to an exemplary embodiment based on term vector;
Fig. 2 be it is shown according to an exemplary embodiment according to term vector, word information amount and part-of-speech tagging as a result, respectively Calculate the flow chart of the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed;
Fig. 3 is the structural schematic diagram of the diagnostic standardization device shown according to an exemplary embodiment based on term vector.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Diagnostic standardization method and device provided in an embodiment of the present invention based on term vector is obtained and current by calculating (diagnosis sentence i.e. to be processed) immediate standard disease is diagnosed, it is real especially by diagnosis-disease pair semantic distance is calculated It is existing.Wherein diagnosis-disease semantic distance belongs to a semantic similarity measurement, is diagnosed and disease simple sentence, diagnosis-disease using calculating The information delta of grammatically wrong sentence pair is completed.
Fig. 1 is the flow chart of the diagnostic standardization method shown according to an exemplary embodiment based on term vector, reference Shown in Fig. 1, this method comprises the following steps:
S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains participle knot Fruit.
Specifically, in embodiments of the present invention, participle uses the search engine of jieba participle, non-HMM mode.On the one hand, Using search engine mode, the similarity calculation by term vector model can be facilitated with grain refined word segmentation result.Another party Face ensures that the word segmentation result of same sentence is identical using non-HMM mode, reduces unnecessary similarity calculation, improves speed With accuracy rate.
S2: according to the word segmentation result and the corresponding model of pre- structure, establish the word of the diagnosis sentence to be processed to Amount, word information amount and part-of-speech tagging result.
Specifically, the model constructed in advance includes term vector model, word information amount model and part-of-speech tagging model.Its In, word segmentation result is input in term vector model, the term vector of diagnosis sentence to be processed is obtained, word segmentation result is input to In word information amount model, the word information amount of diagnosis sentence to be processed is obtained, word segmentation result is input in part-of-speech tagging model, Obtain part-of-speech tagging result.
Before term vector model and word information amount model construction, need first to prepare training corpus collection and testing material collection. In embodiments of the present invention, the data and medicine corpus in ICD-10 (International Classification of Diseases) java standard library are obtained, by ICD-10 It is training corpus collection and testing material collection that data and medicine corpus in (International Classification of Diseases) java standard library, which carry out random division,. Its traditional Chinese medicine corpus can be obtained by sources such as medical diagnosis, medicine network forum, medical electronic books, need to have with medicine The accuracy of model can be improved in certain degree of correlation in this way.If can not obtain, general corpus can also be used, but can be damaged Lose the part accuracy of model.
Term vector model and word information amount model are built first, and the sentence then concentrated to training corpus carries out at participle Reason, word segmentation result is input in the term vector model put up or word information amount model, to term vector model or word information amount Model is trained, until term vector model or word information amount model are restrained.After the completion of training, then the language that testing material is concentrated After sentence carries out word segmentation processing, word segmentation result is input in trained term vector model or word information amount model, to training Term vector model or word information amount model tested, verify its accuracy.
In embodiments of the present invention, term vector model uses skip-gram model.Skip-gram is based on neural network Term vector model.According to the arrangement of word in document and there is order, word is mapped in higher dimensional space, it is empty by calculating Between similarity between the middle available word of COS distance.By in ICD-10 (International Classification of Diseases) java standard library data and doctor Skip-gram model is trained as training data after learning corpus word segmentation processing, obtains the term vector model of medical domain.
Word information amount is obtained by calculating the IDF value of related corpus.IDF refers to reverse document-frequency (Inverse Document Frequency), wherein N indicates number of files in collection of document, nkIndicate the number of files for lexical item k occur.If packet Document containing entry is fewer, and IDF can be bigger, then illustrates that entry has good class discrimination ability, information content is higher.The present invention Weight in embodiment using IDF as connection Word similarity and sentence similarity.Wherein the calculation formula of word information amount is as follows:
Part-of-speech tagging uses Stanford POS tagger model, and the part of speech after mark includes noun, verb, predicate etc.. When calculating similarity, the word can be selected to calculate the weight in similarity according to the difference of part of speech.For example, " be, be " etc. is even The weight for connecing word is smaller, and the weight of noun be generally arranged it is larger.
S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately the diagnosis language to be processed The similarity of sentence and relevant criterion diagnosis sentence in java standard library.
Specifically, as a kind of more excellent embodiment obtained, in calculating the diagnosis sentence to be processed and java standard library Before relevant criterion diagnoses the similarity of sentence, needs first to carry out word segmentation processing to relevant criterion diagnosis sentence in java standard library, obtain mark The corresponding word segmentation result of quasi- diagnosis sentence.Then according to the corresponding word segmentation result of standard diagnostics sentence establish corresponding term vector, Word information amount and part-of-speech tagging result.Then in conjunction with the corresponding term vector of standard diagnostics sentence, word information amount and part of speech mark Note result and the corresponding term vector of diagnosis sentence, word information amount and part-of-speech tagging to be processed are as a result, calculate to be processed examine The similarity of relevant criterion diagnosis sentence in conclusion sentence and java standard library.Specifically word can be obtained by calculating COS distance in space Between similarity.Wherein, Plays of embodiment of the present invention library includes ICD-10 (International Classification of Diseases) java standard library etc..Calculate to In the diagnosis sentence and java standard library of processing before the similarity of relevant criterion diagnosis sentence, the first preliminary screening from java standard library is needed to go out Standard diagnostics sentence relevant to diagnosis sentence to be processed, the mode of preliminary screening include being retrieved by keyword.
As a kind of preferably embodiment, in the embodiment of the present invention, term vector, the word of diagnosis sentence to be processed are established Before information content and part-of-speech tagging result further include:
According to the abbreviation dictionary of pre- structure, the abbreviation in the word segmentation result is converted into corresponding standard words.
Specifically, common spoken language is write a Chinese character in simplified form can be mapped in standard saying by way of searching and calculating replacement, such as " upper sense "- > " the infection of the upper respiratory tract ".It needs to pre-establish abbreviation dictionary in the embodiment of the present invention, before calculating similarity, needs to compare Abbreviation dictionary is searched, the word abridged in word segmentation result is converted into the corresponding standard words found in abbreviation dictionary.
S4: mark of the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis sentence is chosen Standardization result.
As a kind of preferably embodiment, in the embodiment of the present invention, step S4 is specifically included:
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence to be processed and java standard library is ranked up processing, According to ranking results, the standard diagnostics sentence most like with the diagnosis sentence to be processed is chosen as the diagnosis sentence Standardization result.
Specifically, under normal circumstances, the maximum standard diagnostics sentence of similarity with diagnosis sentence to be processed can be chosen Standardization result as diagnosis sentence.But it is approximate with a plurality of standard diagnostics in diagnosis sentence (diagnosis) to be processed In the case where, it may appear that similarity approaches or even equal situation.When similarity is lower than certain threshold value, we use specified Sort algorithm screening criteria disease.I.e. according to similarity score, the maximum phase for meeting threshold value is directly chosen when point spread is big It is sorted according to formula when point spread is small like the character (i.e. standard diagnostics sentence) of degree score and chooses most like character (i.e. standard diagnostics sentence).Formula is as follows:
Idx (T)=α logn (T)-(1- α) (len (T))
Wherein n (T) is the frequency based on historical statistics, and len (T) is that sentence is long, and α is adjustment parameter.
As a kind of preferably embodiment, in the embodiment of the present invention, method further include:
S5: by diagnosis and treatment-standard of correction to as training corpus, supplementary training is carried out to the corresponding model of pre- structure, and right Corresponding model after training carries out accuracy rate assessment, replaces original model with the qualified corresponding model of assessment.
Specifically, can be periodically updated according to the regular hour to correlation model in the embodiment of the present invention.For example, When wrong division example reaches certain amount, executes more new technological process and obtain new model.Diagnosis and treatment-standard pair when update, after correction (diagnosis sentence and standard diagnostics sentence i.e. to be processed) is used as training corpus, and after carrying out word segmentation processing to it, what be will acquire divides Word result is input in the corresponding model (including term vector model, word information amount model etc.) of pre- structure, carries out supplement instruction to model Practice.In addition it before the model after training coming into operation again, also needs to carry out accuracy rate assessment to the corresponding model after supplementary training, if It meets the requirements, just corresponding model is updated.Corresponding model is regularly updated, the accuracy rate of model can be improved, from And improve the accuracy rate of diagnostic standardization.
After correlation model updates, sequence is recalculated, according to statistical result, updates the calculation formula of sort algorithm, specifically When implementation, realized by updating adjustment parameter α.
In addition to this, as a kind of preferably embodiment, in the embodiment of the present invention, can also the period be carried out to IDF file Property is updated, and the update cycle can be configured according to the specific requirements of user.
Fig. 2 be it is shown according to an exemplary embodiment according to term vector, word information amount and part-of-speech tagging as a result, respectively The flow chart for calculating the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed, referring to shown in Fig. 2, Include the following steps:
S3.1: according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and phase in the java standard library Close the Word similarity of standard diagnostics sentence.
Specifically, usually calculating similarity between word includes two methods of semantic similarity and statistics similarity.Wherein language Adopted similarity includes the conventional method based on semantic tree construction using WordNet as representative.On the one hand, WordNet is general Semantic model, does not carry out specially treated for medical data, and a large amount of medical terminologys inside can not find corresponding contents, another party Face, maintenance WordNet need special manual maintenance, and time cost is high.Therefore, in the embodiment of the present invention, ICD-10 is (international Classification of diseases) after data and medicine corpus word segmentation processing in java standard library as training data to skip-gram model (word to Amount model) it is trained, obtain the term vector model of medical domain.Term vector model is according to the arrangement and appearance of word in document Word is mapped in higher dimensional space by order, the space COS distance of two words is first calculated according to term vector, further according to part of speech mark Note result is weighted adjusting, obtains the similarity (Word similarity) between word.Bluebeard compound vector sum part-of-speech tagging is as a result, calculate When similarity, the word can be adjusted according to the difference of part of speech and is calculating the weight in similarity.For example, the connection such as " will be, be " The weight of word is set as smaller, and the weight of noun be generally arranged it is larger.
S3.2: according to the Word similarity and the word information amount, diagnosis sentence and the institute of the processing are calculated separately State the similarity of relevant criterion diagnosis sentence in java standard library.
Specifically, illustratively illustrating the embodiment of the present invention is how to calculate a similarity with one, process is as follows:
Two sentences are given, (diagnosis sentence to be processed) T is diagnosed1With disease (standard diagnostics sentence) T2, form a joint Word set
T=T1∪T2={ w1, w2..., wm}
Joint word set T includes to come from T1And T2All different terms.Such as:
T1: right radical bone avulsion fracture
T2: root bone fracture
Then T={ right radical bone avulsion fracture fracture }
Since joint word set is entirely derived, its not additional information from above-mentioned two sentence.Joint word Collection T can be regarded as whole semantic informations of two sentences.The expression of joint word set can be used in the semantic information of each sentence.Tool Body, vector is referred to as lexical semantic vector derived from joint word set, is expressed as S={ s1, s2..., sm}.Lexical semantic to Each entry of amount corresponds to a word in joint word set, therefore the dimension of lexical semantic vector is equal to the word in joint word set Number.The value of one entry of lexical semantic vector is determined by the Semantic Similarity of corresponding word and the word in sentence 's.With T1For:
If 1, wiAppear in sentence T1In, then siIt is set as 1, wherein [1, m] i ∈;
If 2, T1In do not include wi, then w is calculated using above-mentioned Word similarity calculation methodiWith sentence T1In each list Semantic Similarity score between word.If T1To wiIn most like word w ' there is highest similarity score θ.If θ is more than default Threshold value, then si=θ;Otherwise, si=0, wherein [1, m] i ∈.
What needs to be explained here is that the reason of setting threshold value be because tightened up to semantic similarity in medical field, As " scald " and " burn " can be understood as in routinely semanteme is similar it is synonymous, however in medicine, " scald " and " burn " category In various disease.
In addition, different words has different degrees of contribution to the meaning of sentence.It is each to weight that therefore, it is necessary to a kind of schemes Word.In the embodiment of the present invention, its importance is measured with the information content of a word, is obtained according to the method in above-mentioned steps S2 The word information amount I (w of each wordi), to obtain weighting vocabulary semantic vector
swi=si*I(wi)*I(w′i)
Wherein wiIt is T1In word, w 'iIt is in T and wiCorresponding word.T1And T2Sentence semantic similarity can pass through T1And T2The COS distance of lexical semantic vector be calculated, formula is as follows:
Fig. 3 is the structural schematic diagram of the diagnostic standardization device shown according to an exemplary embodiment based on term vector, Referring to shown in Fig. 3, which includes:
Word segmentation module carries out word segmentation processing to the diagnosis sentence to be processed for obtaining diagnosis sentence to be processed, Obtain word segmentation result;
It constructs module and establishes the diagnosis to be processed for the corresponding model according to the word segmentation result and pre- structure Term vector, word information amount and the part-of-speech tagging result of sentence;
Computing module, for according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately described wait locate The similarity of relevant criterion diagnosis sentence in the diagnosis sentence and java standard library of reason;
Output module is examined described in the standard diagnostics sentence conduct most like with the diagnosis sentence to be processed for choosing The standardization result of conclusion sentence.
As a kind of preferably embodiment, in the embodiment of the present invention, module is constructed further include:
Replacement unit is converted to the abbreviation in the word segmentation result corresponding for the abbreviation dictionary according to pre- structure Standard words.
As a kind of preferably embodiment, in the embodiment of the present invention, computing module includes:
Word similarity computing unit, for according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result With the Word similarity of relevant criterion diagnosis sentence in the java standard library;
Sentence similarity calculated, for calculating separately the place according to the Word similarity and the word information amount The similarity of relevant criterion diagnosis sentence in the diagnosis sentence of reason and the java standard library.
As a kind of preferably embodiment, in the embodiment of the present invention, output module includes:
Sequencing unit, for relevant criterion in the diagnosis sentence and java standard library to be processed to be diagnosed to the similarity of sentence It is ranked up processing;
Selection unit, for according to ranking results, choosing the standard diagnostics most like with the diagnosis sentence to be processed Standardization result of the sentence as the diagnosis sentence.
As a kind of preferably embodiment, in the embodiment of the present invention, the device further include:
Update module, diagnosis and treatment-standard for that will correct supplement the corresponding model of pre- structure to as training corpus Training, and accuracy rate assessment is carried out to the corresponding model after training, original model is replaced with the qualified corresponding model of assessment.
In conclusion technical solution provided in an embodiment of the present invention has the benefit that
1, the diagnostic standardization method and device provided by the invention based on term vector, by calculating diagnosis language to be processed The semantic similarity of sentence and the immediate standard diagnostics sentence of Current Diagnostic is chosen and diagnosis statement similarity highest to be processed Standard diagnostics sentence corresponding model is updated as standardization result, and periodically, improve the correct of diagnostic standardization Rate;
2, the diagnostic standardization method and device provided by the invention based on term vector, is examined automatically by corresponding model Disconnected standardization greatly reduces the manpower consumption of diagnostic standardization, while can provide complete meter for subsequent accurate standardization Calculation method and process.
It should be understood that the diagnostic standardization device provided by the above embodiment based on term vector is in triggering diagnostic criteria When change business, only the example of the division of the above functional modules, in practical application, can according to need and will be above-mentioned Function distribution is completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, with complete with The all or part of function of upper description.In addition, the diagnostic standardization device and base provided by the above embodiment based on term vector Belong to same design in the diagnostic standardization embodiment of the method for term vector, specific implementation process is detailed in embodiment of the method, here It repeats no more.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.

Claims (10)

1. a kind of diagnostic standardization method based on term vector, which is characterized in that described method includes following steps:
S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains word segmentation result:
S2: according to the word segmentation result and the corresponding model of pre- structure, term vector, the word of the diagnosis sentence to be processed are established Information content and part-of-speech tagging result;
S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculate separately the diagnosis sentence to be processed with The similarity of relevant criterion diagnosis sentence in java standard library;
S4: standardization of the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis sentence is chosen As a result.
2. the diagnostic standardization method according to claim 1 based on term vector, which is characterized in that described in the foundation to Before the diagnosis term vector of sentence of processing, word information amount and part-of-speech tagging result further include:
According to the abbreviation dictionary of pre- structure, the abbreviation in the word segmentation result is converted into corresponding standard words.
3. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the step S3 It specifically includes:
S3.1: it is marked to related the java standard library according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result The Word similarity of quasi- diagnosis sentence;
S3.2: according to the Word similarity and the word information amount, the diagnosis sentence and the mark of the processing are calculated separately The similarity of relevant criterion diagnosis sentence in quasi- library.
4. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the step S4 It specifically includes:
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence to be processed and java standard library is ranked up processing, according to Ranking results choose the mark with the most like standard diagnostics sentence of the diagnosis sentence to be processed as the diagnosis sentence Standardization result.
5. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the method is also Include:
S5: by diagnosis and treatment-standard of correction to as training corpus, supplementary training is carried out to the corresponding model of pre- structure, and to training Corresponding model afterwards carries out accuracy rate assessment, replaces original model with the qualified corresponding model of assessment.
6. a kind of diagnostic standardization device based on term vector, which is characterized in that described device includes:
Word segmentation module carries out word segmentation processing to the diagnosis sentence to be processed, obtains for obtaining diagnosis sentence to be processed Word segmentation result;
It constructs module and establishes the diagnosis sentence to be processed for the corresponding model according to the word segmentation result and pre- structure Term vector, word information amount and part-of-speech tagging result;
Computing module, for according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately described to be processed Diagnose the similarity of relevant criterion diagnosis sentence in sentence and java standard library;
Output module, for choosing the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis language The standardization result of sentence.
7. the diagnostic standardization device according to claim 6 based on term vector, which is characterized in that the building module is also Include:
Abbreviation in the word segmentation result is converted to corresponding standard for the abbreviation dictionary according to pre- structure by replacement unit Word.
8. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that the calculating mould Block includes:
Word similarity computing unit, for according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and institute State the Word similarity of relevant criterion diagnosis sentence in java standard library;
Sentence similarity calculated, for calculating separately the processing according to the Word similarity and the word information amount Diagnose the similarity of relevant criterion diagnosis sentence in sentence and the java standard library.
9. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that the output mould Block includes:
Sequencing unit, for carrying out the similarity of relevant criterion diagnosis sentence in the diagnosis sentence and java standard library to be processed Sequence processing;
Selection unit, for according to ranking results, choosing the standard diagnostics sentence most like with the diagnosis sentence to be processed Standardization result as the diagnosis sentence.
10. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that described device is also Include:
Update module, diagnosis and treatment-standard for that will correct carry out supplement instruction to the corresponding model of pre- structure to as training corpus Practice, and accuracy rate assessment is carried out to the corresponding model after training, replaces original model with the qualified corresponding model of assessment.
CN201811551703.8A 2018-12-18 2018-12-18 A kind of diagnostic standardization method and device based on term vector Pending CN109697286A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811551703.8A CN109697286A (en) 2018-12-18 2018-12-18 A kind of diagnostic standardization method and device based on term vector
PCT/CN2019/080416 WO2020124856A1 (en) 2018-12-18 2019-03-29 Diagnosis standardization method and device based on word vectors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811551703.8A CN109697286A (en) 2018-12-18 2018-12-18 A kind of diagnostic standardization method and device based on term vector

Publications (1)

Publication Number Publication Date
CN109697286A true CN109697286A (en) 2019-04-30

Family

ID=66232696

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811551703.8A Pending CN109697286A (en) 2018-12-18 2018-12-18 A kind of diagnostic standardization method and device based on term vector

Country Status (2)

Country Link
CN (1) CN109697286A (en)
WO (1) WO2020124856A1 (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209793A (en) * 2019-06-18 2019-09-06 佰聆数据股份有限公司 A method of for intelligent recognition text semantic
CN110457689A (en) * 2019-07-26 2019-11-15 科大讯飞(苏州)科技有限公司 Semantic processes method and relevant apparatus
CN110767296A (en) * 2019-10-09 2020-02-07 北京雅丁信息技术有限公司 Operation coding method based on semantic similarity
CN110797101A (en) * 2019-10-28 2020-02-14 腾讯医疗健康(深圳)有限公司 Medical data processing method, device, readable storage medium and computer equipment
CN111292814A (en) * 2019-12-26 2020-06-16 北京亚信数据有限公司 Medical data standardization method and device
CN111383769A (en) * 2020-01-08 2020-07-07 科大讯飞股份有限公司 Method, device, equipment and storage medium for detecting complaint and diagnosis consistency
CN111428477A (en) * 2020-03-06 2020-07-17 安徽科大讯飞医疗信息技术有限公司 Diagnostic name standardization method, device, electronic equipment and storage medium
CN111599463A (en) * 2020-05-09 2020-08-28 吾征智能技术(北京)有限公司 Intelligent auxiliary diagnosis system based on sound cognition model
CN111627512A (en) * 2020-05-29 2020-09-04 北京大恒普信医疗技术有限公司 Recommendation method and device for similar medical records, electronic equipment and storage medium
CN111710409A (en) * 2020-05-29 2020-09-25 吾征智能技术(北京)有限公司 Intelligent screening system based on abnormal change of human sweat
CN112022140A (en) * 2020-07-03 2020-12-04 上海数创医疗科技有限公司 Automatic diagnosis method and system for diagnosis conclusion of electrocardiogram
CN113420541A (en) * 2021-07-16 2021-09-21 四川医枢科技有限责任公司 Information processing method, device, equipment and storage medium
CN113593661A (en) * 2021-07-07 2021-11-02 青岛国新健康产业科技有限公司 Clinical term standardization method, device, electronic equipment and storage medium
CN114548115A (en) * 2022-02-23 2022-05-27 北京三快在线科技有限公司 Method and device for explaining compound nouns and electronic equipment
CN117034911A (en) * 2023-09-28 2023-11-10 通用技术集团健康数字科技(北京)有限公司 Correction method and device for hospital diagnosis dictionary, server and storage medium
CN117196856A (en) * 2023-08-11 2023-12-08 中国银行保险信息技术管理有限公司 Processing method and device of claim information, storage medium and computer equipment
CN117275752A (en) * 2023-11-20 2023-12-22 中国人民解放军总医院 Case clustering analysis method and system based on machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130103382A1 (en) * 2011-10-19 2013-04-25 Electronics And Telecommunications Research Institute Method and apparatus for searching similar sentences
CN105095188A (en) * 2015-08-14 2015-11-25 北京京东尚科信息技术有限公司 Sentence similarity computing method and device
CN105653840A (en) * 2015-12-21 2016-06-08 青岛中科慧康科技有限公司 Similar case recommendation system based on word and phrase distributed representation, and corresponding method
CN108509415A (en) * 2018-03-16 2018-09-07 南京云问网络技术有限公司 A kind of sentence similarity computational methods based on word order weighting
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495B (en) * 2011-04-11 2014-04-02 趣拿开曼群岛有限公司 Industry dictionary generating method and device
US20160110325A1 (en) * 2014-10-20 2016-04-21 Materials Engineering And Packaging, Llc Method of Sharing Radiation Therapy Information to Non-Radiation Therapy Practitioners
CN106682411B (en) * 2016-12-22 2019-04-16 浙江大学 A method of disease label is converted by physical examination diagnostic data
CN106897568A (en) * 2017-02-28 2017-06-27 北京大数医达科技有限公司 The treating method and apparatus of case history structuring
CN106933806A (en) * 2017-03-15 2017-07-07 北京大数医达科技有限公司 The determination method and apparatus of medical synonym

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130103382A1 (en) * 2011-10-19 2013-04-25 Electronics And Telecommunications Research Institute Method and apparatus for searching similar sentences
CN105095188A (en) * 2015-08-14 2015-11-25 北京京东尚科信息技术有限公司 Sentence similarity computing method and device
CN105653840A (en) * 2015-12-21 2016-06-08 青岛中科慧康科技有限公司 Similar case recommendation system based on word and phrase distributed representation, and corresponding method
CN108509415A (en) * 2018-03-16 2018-09-07 南京云问网络技术有限公司 A kind of sentence similarity computational methods based on word order weighting
CN108763477A (en) * 2018-05-29 2018-11-06 厦门快商通信息技术有限公司 A kind of short text classification method and system

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王品: "信息检索中的句子相似度计算", 《计算机工程》 *
王文辉: "基于相似度算法的英语智能问答系统设计与实现", 《计算机应用与软件》 *

Cited By (25)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110209793A (en) * 2019-06-18 2019-09-06 佰聆数据股份有限公司 A method of for intelligent recognition text semantic
CN110457689A (en) * 2019-07-26 2019-11-15 科大讯飞(苏州)科技有限公司 Semantic processes method and relevant apparatus
CN110767296A (en) * 2019-10-09 2020-02-07 北京雅丁信息技术有限公司 Operation coding method based on semantic similarity
CN110797101A (en) * 2019-10-28 2020-02-14 腾讯医疗健康(深圳)有限公司 Medical data processing method, device, readable storage medium and computer equipment
CN110797101B (en) * 2019-10-28 2023-11-03 腾讯医疗健康(深圳)有限公司 Medical data processing method, medical data processing device, readable storage medium and computer equipment
CN111292814A (en) * 2019-12-26 2020-06-16 北京亚信数据有限公司 Medical data standardization method and device
CN111383769B (en) * 2020-01-08 2024-04-12 科大讯飞股份有限公司 Method, device, equipment and storage medium for detecting consistency of complaints and diagnoses
CN111383769A (en) * 2020-01-08 2020-07-07 科大讯飞股份有限公司 Method, device, equipment and storage medium for detecting complaint and diagnosis consistency
CN111428477A (en) * 2020-03-06 2020-07-17 安徽科大讯飞医疗信息技术有限公司 Diagnostic name standardization method, device, electronic equipment and storage medium
CN111428477B (en) * 2020-03-06 2023-10-17 讯飞医疗科技股份有限公司 Diagnostic name standardization method, device, electronic equipment and storage medium
CN111599463A (en) * 2020-05-09 2020-08-28 吾征智能技术(北京)有限公司 Intelligent auxiliary diagnosis system based on sound cognition model
CN111710409A (en) * 2020-05-29 2020-09-25 吾征智能技术(北京)有限公司 Intelligent screening system based on abnormal change of human sweat
CN111627512A (en) * 2020-05-29 2020-09-04 北京大恒普信医疗技术有限公司 Recommendation method and device for similar medical records, electronic equipment and storage medium
CN112022140A (en) * 2020-07-03 2020-12-04 上海数创医疗科技有限公司 Automatic diagnosis method and system for diagnosis conclusion of electrocardiogram
CN112022140B (en) * 2020-07-03 2023-02-17 上海数创医疗科技有限公司 Automatic diagnosis method and system for diagnosis conclusion of electrocardiogram
CN113593661A (en) * 2021-07-07 2021-11-02 青岛国新健康产业科技有限公司 Clinical term standardization method, device, electronic equipment and storage medium
CN113420541A (en) * 2021-07-16 2021-09-21 四川医枢科技有限责任公司 Information processing method, device, equipment and storage medium
CN114548115B (en) * 2022-02-23 2023-01-06 北京三快在线科技有限公司 Method and device for explaining compound nouns and electronic equipment
CN114548115A (en) * 2022-02-23 2022-05-27 北京三快在线科技有限公司 Method and device for explaining compound nouns and electronic equipment
CN117196856A (en) * 2023-08-11 2023-12-08 中国银行保险信息技术管理有限公司 Processing method and device of claim information, storage medium and computer equipment
CN117196856B (en) * 2023-08-11 2024-06-25 中国银行保险信息技术管理有限公司 Processing method and device of claim information, storage medium and computer equipment
CN117034911A (en) * 2023-09-28 2023-11-10 通用技术集团健康数字科技(北京)有限公司 Correction method and device for hospital diagnosis dictionary, server and storage medium
CN117034911B (en) * 2023-09-28 2023-12-22 通用技术集团健康数字科技(北京)有限公司 Correction method and device for hospital diagnosis dictionary, server and storage medium
CN117275752A (en) * 2023-11-20 2023-12-22 中国人民解放军总医院 Case clustering analysis method and system based on machine learning
CN117275752B (en) * 2023-11-20 2024-03-22 中国人民解放军总医院 Case clustering analysis method and system based on machine learning

Also Published As

Publication number Publication date
WO2020124856A1 (en) 2020-06-25

Similar Documents

Publication Publication Date Title
CN109697286A (en) A kind of diagnostic standardization method and device based on term vector
CN106874643B (en) Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors
CN110021439A (en) Medical data classification method, device and computer equipment based on machine learning
JP3856778B2 (en) Document classification apparatus and document classification method for multiple languages
US10540442B2 (en) Evaluating temporal relevance in question answering
US20180349560A1 (en) Monitoring the use of language of a patient for identifying potential speech and related neurological disorders
CN105975458B (en) A kind of Chinese long sentence similarity calculating method based on fine granularity dependence
CN103119584B (en) Machine translation evaluation device and method
CN104731774B (en) Towards the personalized interpretation method and device of general machine translation engine
CN113724848A (en) Medical resource recommendation method, device, server and medium based on artificial intelligence
CN110032631B (en) Information feedback method, device and storage medium
CN106897384B (en) Method and device for automatically evaluating key points
CN112541056A (en) Medical term standardization method, device, electronic equipment and storage medium
US11663518B2 (en) Cognitive system virtual corpus training and utilization
CN115858886B (en) Data processing method, device, equipment and readable storage medium
CN108735198B (en) Phoneme synthesizing method, device and electronic equipment based on medical conditions data
CN114548321A (en) Self-supervision public opinion comment viewpoint object classification method based on comparative learning
CN111339285B (en) BP neural network-based enterprise resume screening method and system
Whitney Bootstrapping via graph propagation
CN113657109A (en) Method, apparatus and computer device for standardization of model-based clinical terminology
CN111553140A (en) Data processing method, data processing apparatus, and computer storage medium
CN111128388A (en) Value domain data matching method and device and related products
CN113065355B (en) Professional encyclopedia named entity identification method, system and electronic equipment
CN107122582A (en) Towards the diagnosis and treatment class entity recognition method and device of multi-data source
CN116757195B (en) Implicit emotion recognition method based on prompt learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40005490

Country of ref document: HK

RJ01 Rejection of invention patent application after publication

Application publication date: 20190430

RJ01 Rejection of invention patent application after publication