CN109697286A - A kind of diagnostic standardization method and device based on term vector - Google Patents
A kind of diagnostic standardization method and device based on term vector Download PDFInfo
- Publication number
- CN109697286A CN109697286A CN201811551703.8A CN201811551703A CN109697286A CN 109697286 A CN109697286 A CN 109697286A CN 201811551703 A CN201811551703 A CN 201811551703A CN 109697286 A CN109697286 A CN 109697286A
- Authority
- CN
- China
- Prior art keywords
- sentence
- diagnosis
- word
- processed
- term vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The diagnostic standardization method and device based on term vector that the invention discloses a kind of, this method comprises: S1: obtaining diagnosis sentence to be processed, carry out word segmentation processing to diagnosis sentence to be processed, obtain word segmentation result;S2: according to word segmentation result and the corresponding model of pre- structure, term vector, word information amount and the part-of-speech tagging result of diagnosis sentence to be processed are established;S3: according to term vector, word information amount and part-of-speech tagging as a result, calculating separately the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed;S4: standardization result of the standard diagnostics sentence most like with diagnosis sentence to be processed as diagnosis sentence is chosen.The present invention is by calculating the semantic similarity for diagnosing sentence and the immediate standard diagnostics sentence of Current Diagnostic to be processed, it chooses with the highest standard diagnostics sentence of diagnosis statement similarity to be processed as standardization result, and periodically corresponding model is updated, improve the accuracy of diagnostic standardization.
Description
Technical field
The present invention relates to technical field of data processing, in particular to a kind of diagnostic standardization method and dress based on term vector
It sets.
Background technique
In terms of settlement of insurance claim and medical research statistics, the diagnostic standardization of disease has great importance.Disease at present
Standard includes the ICD-10 (International Classification of Diseases) of internal authority publication.But one side Different hospital has different ICD-
10 diagnostic criteria, the hand-written diagnosis of another aspect doctor and the diagnosis of standard have a certain distance, how will be in Different hospital
Nonstandard diagnosis carries out unified standardization, and be one has the problem of realistic meaning very much.
Current diagnostic standardization method has the following problems:
1, unsupervised natural language processing method is used completely, does not utilize existing mark resource, accuracy rate is lower;
2, artificial revision and mark are relied on, different addresses and ICD-10 including the same ICD-10 disease are frequently repaired
It orders, a large amount of human resources need to be expended.
Summary of the invention
In order to solve problems in the prior art, the embodiment of the invention provides a kind of diagnostic standardization sides based on term vector
Method and device need to expend a large amount of human resources, completely using unsupervised to overcome to rely on artificial revision in the prior art and mark
Natural language processing method, do not utilize existing mark resource, the problems such as accuracy rate is lower.
In order to solve the above technical problems, the technical solution adopted by the present invention is that:
On the one hand, a kind of diagnostic standardization method based on term vector is provided, described method includes following steps:
S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains participle knot
Fruit;
S2: according to the word segmentation result and the corresponding model of pre- structure, establish the word of the diagnosis sentence to be processed to
Amount, word information amount and part-of-speech tagging result;
S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately the diagnosis language to be processed
The similarity of sentence and relevant criterion diagnosis sentence in java standard library;
S4: mark of the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis sentence is chosen
Standardization result.
Further, term vector, word information amount and the part-of-speech tagging knot for establishing the diagnosis sentence to be processed
Before fruit further include:
According to the abbreviation dictionary of pre- structure, the abbreviation in the word segmentation result is converted into corresponding standard words.
Further, the step S3 is specifically included:
S3.1: according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and phase in the java standard library
Close the Word similarity of standard diagnostics sentence;
S3.2: according to the Word similarity and the word information amount, diagnosis sentence and the institute of the processing are calculated separately
State the similarity of relevant criterion diagnosis sentence in java standard library.
Further, the step S4 is specifically included:
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence to be processed and java standard library is ranked up processing,
According to ranking results, the standard diagnostics sentence most like with the diagnosis sentence to be processed is chosen as the diagnosis sentence
Standardization result.
Further, the method also includes:
S5: by diagnosis and treatment-standard of correction to as training corpus, supplementary training is carried out to the corresponding model of pre- structure, and right
Corresponding model after training carries out accuracy rate assessment, replaces original model with the qualified corresponding model of assessment.
On the other hand, a kind of diagnostic standardization device based on term vector is provided, described device includes:
Word segmentation module carries out word segmentation processing to the diagnosis sentence to be processed for obtaining diagnosis sentence to be processed,
Obtain word segmentation result;
It constructs module and establishes the diagnosis to be processed for the corresponding model according to the word segmentation result and pre- structure
Term vector, word information amount and the part-of-speech tagging result of sentence;
Computing module, for according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately described wait locate
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence and java standard library of reason;
Output module is examined described in the standard diagnostics sentence conduct most like with the diagnosis sentence to be processed for choosing
The standardization result of conclusion sentence.
Further, the building module further include:
Replacement unit is converted to the abbreviation in the word segmentation result corresponding for the abbreviation dictionary according to pre- structure
Standard words.
Further, the computing module includes:
Word similarity computing unit, for according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result
With the Word similarity of relevant criterion diagnosis sentence in the java standard library;
Sentence similarity calculated, for calculating separately the place according to the Word similarity and the word information amount
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence of reason and the java standard library.
Further, the output module includes:
Sequencing unit, for relevant criterion in the diagnosis sentence and java standard library to be processed to be diagnosed to the similarity of sentence
It is ranked up processing;
Selection unit, for according to ranking results, choosing the standard diagnostics most like with the diagnosis sentence to be processed
Standardization result of the sentence as the diagnosis sentence.
Further, described device further include:
Update module, diagnosis and treatment-standard for that will correct supplement the corresponding model of pre- structure to as training corpus
Training, and accuracy rate assessment is carried out to the corresponding model after training, original model is replaced with the qualified corresponding model of assessment.
Technical solution provided in an embodiment of the present invention has the benefit that
1, the diagnostic standardization method and device provided by the invention based on term vector, by calculating diagnosis language to be processed
The semantic similarity of sentence and the immediate standard diagnostics sentence of Current Diagnostic is chosen and diagnosis statement similarity highest to be processed
Standard diagnostics sentence corresponding model is updated as standardization result, and periodically, improve the correct of diagnostic standardization
Rate;
2, the diagnostic standardization method and device provided by the invention based on term vector, is examined automatically by corresponding model
Disconnected standardization greatly reduces the manpower consumption of diagnostic standardization, while can provide complete meter for subsequent accurate standardization
Calculation method and process.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for
For those of ordinary skill in the art, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is the flow chart of the diagnostic standardization method shown according to an exemplary embodiment based on term vector;
Fig. 2 be it is shown according to an exemplary embodiment according to term vector, word information amount and part-of-speech tagging as a result, respectively
Calculate the flow chart of the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed;
Fig. 3 is the structural schematic diagram of the diagnostic standardization device shown according to an exemplary embodiment based on term vector.
Specific embodiment
To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached in the embodiment of the present invention
Figure, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only this
Invention a part of the embodiment, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.
Diagnostic standardization method and device provided in an embodiment of the present invention based on term vector is obtained and current by calculating
(diagnosis sentence i.e. to be processed) immediate standard disease is diagnosed, it is real especially by diagnosis-disease pair semantic distance is calculated
It is existing.Wherein diagnosis-disease semantic distance belongs to a semantic similarity measurement, is diagnosed and disease simple sentence, diagnosis-disease using calculating
The information delta of grammatically wrong sentence pair is completed.
Fig. 1 is the flow chart of the diagnostic standardization method shown according to an exemplary embodiment based on term vector, reference
Shown in Fig. 1, this method comprises the following steps:
S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains participle knot
Fruit.
Specifically, in embodiments of the present invention, participle uses the search engine of jieba participle, non-HMM mode.On the one hand,
Using search engine mode, the similarity calculation by term vector model can be facilitated with grain refined word segmentation result.Another party
Face ensures that the word segmentation result of same sentence is identical using non-HMM mode, reduces unnecessary similarity calculation, improves speed
With accuracy rate.
S2: according to the word segmentation result and the corresponding model of pre- structure, establish the word of the diagnosis sentence to be processed to
Amount, word information amount and part-of-speech tagging result.
Specifically, the model constructed in advance includes term vector model, word information amount model and part-of-speech tagging model.Its
In, word segmentation result is input in term vector model, the term vector of diagnosis sentence to be processed is obtained, word segmentation result is input to
In word information amount model, the word information amount of diagnosis sentence to be processed is obtained, word segmentation result is input in part-of-speech tagging model,
Obtain part-of-speech tagging result.
Before term vector model and word information amount model construction, need first to prepare training corpus collection and testing material collection.
In embodiments of the present invention, the data and medicine corpus in ICD-10 (International Classification of Diseases) java standard library are obtained, by ICD-10
It is training corpus collection and testing material collection that data and medicine corpus in (International Classification of Diseases) java standard library, which carry out random division,.
Its traditional Chinese medicine corpus can be obtained by sources such as medical diagnosis, medicine network forum, medical electronic books, need to have with medicine
The accuracy of model can be improved in certain degree of correlation in this way.If can not obtain, general corpus can also be used, but can be damaged
Lose the part accuracy of model.
Term vector model and word information amount model are built first, and the sentence then concentrated to training corpus carries out at participle
Reason, word segmentation result is input in the term vector model put up or word information amount model, to term vector model or word information amount
Model is trained, until term vector model or word information amount model are restrained.After the completion of training, then the language that testing material is concentrated
After sentence carries out word segmentation processing, word segmentation result is input in trained term vector model or word information amount model, to training
Term vector model or word information amount model tested, verify its accuracy.
In embodiments of the present invention, term vector model uses skip-gram model.Skip-gram is based on neural network
Term vector model.According to the arrangement of word in document and there is order, word is mapped in higher dimensional space, it is empty by calculating
Between similarity between the middle available word of COS distance.By in ICD-10 (International Classification of Diseases) java standard library data and doctor
Skip-gram model is trained as training data after learning corpus word segmentation processing, obtains the term vector model of medical domain.
Word information amount is obtained by calculating the IDF value of related corpus.IDF refers to reverse document-frequency (Inverse
Document Frequency), wherein N indicates number of files in collection of document, nkIndicate the number of files for lexical item k occur.If packet
Document containing entry is fewer, and IDF can be bigger, then illustrates that entry has good class discrimination ability, information content is higher.The present invention
Weight in embodiment using IDF as connection Word similarity and sentence similarity.Wherein the calculation formula of word information amount is as follows:
Part-of-speech tagging uses Stanford POS tagger model, and the part of speech after mark includes noun, verb, predicate etc..
When calculating similarity, the word can be selected to calculate the weight in similarity according to the difference of part of speech.For example, " be, be " etc. is even
The weight for connecing word is smaller, and the weight of noun be generally arranged it is larger.
S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately the diagnosis language to be processed
The similarity of sentence and relevant criterion diagnosis sentence in java standard library.
Specifically, as a kind of more excellent embodiment obtained, in calculating the diagnosis sentence to be processed and java standard library
Before relevant criterion diagnoses the similarity of sentence, needs first to carry out word segmentation processing to relevant criterion diagnosis sentence in java standard library, obtain mark
The corresponding word segmentation result of quasi- diagnosis sentence.Then according to the corresponding word segmentation result of standard diagnostics sentence establish corresponding term vector,
Word information amount and part-of-speech tagging result.Then in conjunction with the corresponding term vector of standard diagnostics sentence, word information amount and part of speech mark
Note result and the corresponding term vector of diagnosis sentence, word information amount and part-of-speech tagging to be processed are as a result, calculate to be processed examine
The similarity of relevant criterion diagnosis sentence in conclusion sentence and java standard library.Specifically word can be obtained by calculating COS distance in space
Between similarity.Wherein, Plays of embodiment of the present invention library includes ICD-10 (International Classification of Diseases) java standard library etc..Calculate to
In the diagnosis sentence and java standard library of processing before the similarity of relevant criterion diagnosis sentence, the first preliminary screening from java standard library is needed to go out
Standard diagnostics sentence relevant to diagnosis sentence to be processed, the mode of preliminary screening include being retrieved by keyword.
As a kind of preferably embodiment, in the embodiment of the present invention, term vector, the word of diagnosis sentence to be processed are established
Before information content and part-of-speech tagging result further include:
According to the abbreviation dictionary of pre- structure, the abbreviation in the word segmentation result is converted into corresponding standard words.
Specifically, common spoken language is write a Chinese character in simplified form can be mapped in standard saying by way of searching and calculating replacement, such as " upper sense "-
> " the infection of the upper respiratory tract ".It needs to pre-establish abbreviation dictionary in the embodiment of the present invention, before calculating similarity, needs to compare
Abbreviation dictionary is searched, the word abridged in word segmentation result is converted into the corresponding standard words found in abbreviation dictionary.
S4: mark of the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis sentence is chosen
Standardization result.
As a kind of preferably embodiment, in the embodiment of the present invention, step S4 is specifically included:
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence to be processed and java standard library is ranked up processing,
According to ranking results, the standard diagnostics sentence most like with the diagnosis sentence to be processed is chosen as the diagnosis sentence
Standardization result.
Specifically, under normal circumstances, the maximum standard diagnostics sentence of similarity with diagnosis sentence to be processed can be chosen
Standardization result as diagnosis sentence.But it is approximate with a plurality of standard diagnostics in diagnosis sentence (diagnosis) to be processed
In the case where, it may appear that similarity approaches or even equal situation.When similarity is lower than certain threshold value, we use specified
Sort algorithm screening criteria disease.I.e. according to similarity score, the maximum phase for meeting threshold value is directly chosen when point spread is big
It is sorted according to formula when point spread is small like the character (i.e. standard diagnostics sentence) of degree score and chooses most like character
(i.e. standard diagnostics sentence).Formula is as follows:
Idx (T)=α logn (T)-(1- α) (len (T))
Wherein n (T) is the frequency based on historical statistics, and len (T) is that sentence is long, and α is adjustment parameter.
As a kind of preferably embodiment, in the embodiment of the present invention, method further include:
S5: by diagnosis and treatment-standard of correction to as training corpus, supplementary training is carried out to the corresponding model of pre- structure, and right
Corresponding model after training carries out accuracy rate assessment, replaces original model with the qualified corresponding model of assessment.
Specifically, can be periodically updated according to the regular hour to correlation model in the embodiment of the present invention.For example,
When wrong division example reaches certain amount, executes more new technological process and obtain new model.Diagnosis and treatment-standard pair when update, after correction
(diagnosis sentence and standard diagnostics sentence i.e. to be processed) is used as training corpus, and after carrying out word segmentation processing to it, what be will acquire divides
Word result is input in the corresponding model (including term vector model, word information amount model etc.) of pre- structure, carries out supplement instruction to model
Practice.In addition it before the model after training coming into operation again, also needs to carry out accuracy rate assessment to the corresponding model after supplementary training, if
It meets the requirements, just corresponding model is updated.Corresponding model is regularly updated, the accuracy rate of model can be improved, from
And improve the accuracy rate of diagnostic standardization.
After correlation model updates, sequence is recalculated, according to statistical result, updates the calculation formula of sort algorithm, specifically
When implementation, realized by updating adjustment parameter α.
In addition to this, as a kind of preferably embodiment, in the embodiment of the present invention, can also the period be carried out to IDF file
Property is updated, and the update cycle can be configured according to the specific requirements of user.
Fig. 2 be it is shown according to an exemplary embodiment according to term vector, word information amount and part-of-speech tagging as a result, respectively
The flow chart for calculating the similarity of relevant criterion diagnosis sentence in diagnosis sentence and java standard library to be processed, referring to shown in Fig. 2,
Include the following steps:
S3.1: according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and phase in the java standard library
Close the Word similarity of standard diagnostics sentence.
Specifically, usually calculating similarity between word includes two methods of semantic similarity and statistics similarity.Wherein language
Adopted similarity includes the conventional method based on semantic tree construction using WordNet as representative.On the one hand, WordNet is general
Semantic model, does not carry out specially treated for medical data, and a large amount of medical terminologys inside can not find corresponding contents, another party
Face, maintenance WordNet need special manual maintenance, and time cost is high.Therefore, in the embodiment of the present invention, ICD-10 is (international
Classification of diseases) after data and medicine corpus word segmentation processing in java standard library as training data to skip-gram model (word to
Amount model) it is trained, obtain the term vector model of medical domain.Term vector model is according to the arrangement and appearance of word in document
Word is mapped in higher dimensional space by order, the space COS distance of two words is first calculated according to term vector, further according to part of speech mark
Note result is weighted adjusting, obtains the similarity (Word similarity) between word.Bluebeard compound vector sum part-of-speech tagging is as a result, calculate
When similarity, the word can be adjusted according to the difference of part of speech and is calculating the weight in similarity.For example, the connection such as " will be, be "
The weight of word is set as smaller, and the weight of noun be generally arranged it is larger.
S3.2: according to the Word similarity and the word information amount, diagnosis sentence and the institute of the processing are calculated separately
State the similarity of relevant criterion diagnosis sentence in java standard library.
Specifically, illustratively illustrating the embodiment of the present invention is how to calculate a similarity with one, process is as follows:
Two sentences are given, (diagnosis sentence to be processed) T is diagnosed1With disease (standard diagnostics sentence) T2, form a joint
Word set
T=T1∪T2={ w1, w2..., wm}
Joint word set T includes to come from T1And T2All different terms.Such as:
T1: right radical bone avulsion fracture
T2: root bone fracture
Then T={ right radical bone avulsion fracture fracture }
Since joint word set is entirely derived, its not additional information from above-mentioned two sentence.Joint word
Collection T can be regarded as whole semantic informations of two sentences.The expression of joint word set can be used in the semantic information of each sentence.Tool
Body, vector is referred to as lexical semantic vector derived from joint word set, is expressed as S={ s1, s2..., sm}.Lexical semantic to
Each entry of amount corresponds to a word in joint word set, therefore the dimension of lexical semantic vector is equal to the word in joint word set
Number.The value of one entry of lexical semantic vector is determined by the Semantic Similarity of corresponding word and the word in sentence
's.With T1For:
If 1, wiAppear in sentence T1In, then siIt is set as 1, wherein [1, m] i ∈;
If 2, T1In do not include wi, then w is calculated using above-mentioned Word similarity calculation methodiWith sentence T1In each list
Semantic Similarity score between word.If T1To wiIn most like word w ' there is highest similarity score θ.If θ is more than default
Threshold value, then si=θ;Otherwise, si=0, wherein [1, m] i ∈.
What needs to be explained here is that the reason of setting threshold value be because tightened up to semantic similarity in medical field,
As " scald " and " burn " can be understood as in routinely semanteme is similar it is synonymous, however in medicine, " scald " and " burn " category
In various disease.
In addition, different words has different degrees of contribution to the meaning of sentence.It is each to weight that therefore, it is necessary to a kind of schemes
Word.In the embodiment of the present invention, its importance is measured with the information content of a word, is obtained according to the method in above-mentioned steps S2
The word information amount I (w of each wordi), to obtain weighting vocabulary semantic vector
swi=si*I(wi)*I(w′i)
Wherein wiIt is T1In word, w 'iIt is in T and wiCorresponding word.T1And T2Sentence semantic similarity can pass through
T1And T2The COS distance of lexical semantic vector be calculated, formula is as follows:
Fig. 3 is the structural schematic diagram of the diagnostic standardization device shown according to an exemplary embodiment based on term vector,
Referring to shown in Fig. 3, which includes:
Word segmentation module carries out word segmentation processing to the diagnosis sentence to be processed for obtaining diagnosis sentence to be processed,
Obtain word segmentation result;
It constructs module and establishes the diagnosis to be processed for the corresponding model according to the word segmentation result and pre- structure
Term vector, word information amount and the part-of-speech tagging result of sentence;
Computing module, for according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately described wait locate
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence and java standard library of reason;
Output module is examined described in the standard diagnostics sentence conduct most like with the diagnosis sentence to be processed for choosing
The standardization result of conclusion sentence.
As a kind of preferably embodiment, in the embodiment of the present invention, module is constructed further include:
Replacement unit is converted to the abbreviation in the word segmentation result corresponding for the abbreviation dictionary according to pre- structure
Standard words.
As a kind of preferably embodiment, in the embodiment of the present invention, computing module includes:
Word similarity computing unit, for according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result
With the Word similarity of relevant criterion diagnosis sentence in the java standard library;
Sentence similarity calculated, for calculating separately the place according to the Word similarity and the word information amount
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence of reason and the java standard library.
As a kind of preferably embodiment, in the embodiment of the present invention, output module includes:
Sequencing unit, for relevant criterion in the diagnosis sentence and java standard library to be processed to be diagnosed to the similarity of sentence
It is ranked up processing;
Selection unit, for according to ranking results, choosing the standard diagnostics most like with the diagnosis sentence to be processed
Standardization result of the sentence as the diagnosis sentence.
As a kind of preferably embodiment, in the embodiment of the present invention, the device further include:
Update module, diagnosis and treatment-standard for that will correct supplement the corresponding model of pre- structure to as training corpus
Training, and accuracy rate assessment is carried out to the corresponding model after training, original model is replaced with the qualified corresponding model of assessment.
In conclusion technical solution provided in an embodiment of the present invention has the benefit that
1, the diagnostic standardization method and device provided by the invention based on term vector, by calculating diagnosis language to be processed
The semantic similarity of sentence and the immediate standard diagnostics sentence of Current Diagnostic is chosen and diagnosis statement similarity highest to be processed
Standard diagnostics sentence corresponding model is updated as standardization result, and periodically, improve the correct of diagnostic standardization
Rate;
2, the diagnostic standardization method and device provided by the invention based on term vector, is examined automatically by corresponding model
Disconnected standardization greatly reduces the manpower consumption of diagnostic standardization, while can provide complete meter for subsequent accurate standardization
Calculation method and process.
It should be understood that the diagnostic standardization device provided by the above embodiment based on term vector is in triggering diagnostic criteria
When change business, only the example of the division of the above functional modules, in practical application, can according to need and will be above-mentioned
Function distribution is completed by different functional modules, i.e., the internal structure of device is divided into different functional modules, with complete with
The all or part of function of upper description.In addition, the diagnostic standardization device and base provided by the above embodiment based on term vector
Belong to same design in the diagnostic standardization embodiment of the method for term vector, specific implementation process is detailed in embodiment of the method, here
It repeats no more.
Those of ordinary skill in the art will appreciate that realizing that all or part of the steps of above-described embodiment can pass through hardware
It completes, relevant hardware can also be instructed to complete by program, the program can store in a kind of computer-readable
In storage medium, storage medium mentioned above can be read-only memory, disk or CD etc..
The foregoing is merely presently preferred embodiments of the present invention, is not intended to limit the invention, it is all in spirit of the invention and
Within principle, any modification, equivalent replacement, improvement and so on be should all be included in the protection scope of the present invention.
Claims (10)
1. a kind of diagnostic standardization method based on term vector, which is characterized in that described method includes following steps:
S1: obtaining diagnosis sentence to be processed, carries out word segmentation processing to the diagnosis sentence to be processed, obtains word segmentation result:
S2: according to the word segmentation result and the corresponding model of pre- structure, term vector, the word of the diagnosis sentence to be processed are established
Information content and part-of-speech tagging result;
S3: according to the term vector, word information amount and part-of-speech tagging as a result, calculate separately the diagnosis sentence to be processed with
The similarity of relevant criterion diagnosis sentence in java standard library;
S4: standardization of the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis sentence is chosen
As a result.
2. the diagnostic standardization method according to claim 1 based on term vector, which is characterized in that described in the foundation to
Before the diagnosis term vector of sentence of processing, word information amount and part-of-speech tagging result further include:
According to the abbreviation dictionary of pre- structure, the abbreviation in the word segmentation result is converted into corresponding standard words.
3. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the step S3
It specifically includes:
S3.1: it is marked to related the java standard library according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result
The Word similarity of quasi- diagnosis sentence;
S3.2: according to the Word similarity and the word information amount, the diagnosis sentence and the mark of the processing are calculated separately
The similarity of relevant criterion diagnosis sentence in quasi- library.
4. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the step S4
It specifically includes:
The similarity of relevant criterion diagnosis sentence in the diagnosis sentence to be processed and java standard library is ranked up processing, according to
Ranking results choose the mark with the most like standard diagnostics sentence of the diagnosis sentence to be processed as the diagnosis sentence
Standardization result.
5. the diagnostic standardization method according to claim 1 or 2 based on term vector, which is characterized in that the method is also
Include:
S5: by diagnosis and treatment-standard of correction to as training corpus, supplementary training is carried out to the corresponding model of pre- structure, and to training
Corresponding model afterwards carries out accuracy rate assessment, replaces original model with the qualified corresponding model of assessment.
6. a kind of diagnostic standardization device based on term vector, which is characterized in that described device includes:
Word segmentation module carries out word segmentation processing to the diagnosis sentence to be processed, obtains for obtaining diagnosis sentence to be processed
Word segmentation result;
It constructs module and establishes the diagnosis sentence to be processed for the corresponding model according to the word segmentation result and pre- structure
Term vector, word information amount and part-of-speech tagging result;
Computing module, for according to the term vector, word information amount and part-of-speech tagging as a result, calculating separately described to be processed
Diagnose the similarity of relevant criterion diagnosis sentence in sentence and java standard library;
Output module, for choosing the standard diagnostics sentence most like with the diagnosis sentence to be processed as the diagnosis language
The standardization result of sentence.
7. the diagnostic standardization device according to claim 6 based on term vector, which is characterized in that the building module is also
Include:
Abbreviation in the word segmentation result is converted to corresponding standard for the abbreviation dictionary according to pre- structure by replacement unit
Word.
8. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that the calculating mould
Block includes:
Word similarity computing unit, for according to the term vector and part-of-speech tagging as a result, calculating the word segmentation result and institute
State the Word similarity of relevant criterion diagnosis sentence in java standard library;
Sentence similarity calculated, for calculating separately the processing according to the Word similarity and the word information amount
Diagnose the similarity of relevant criterion diagnosis sentence in sentence and the java standard library.
9. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that the output mould
Block includes:
Sequencing unit, for carrying out the similarity of relevant criterion diagnosis sentence in the diagnosis sentence and java standard library to be processed
Sequence processing;
Selection unit, for according to ranking results, choosing the standard diagnostics sentence most like with the diagnosis sentence to be processed
Standardization result as the diagnosis sentence.
10. the diagnostic standardization device according to claim 6 or 7 based on term vector, which is characterized in that described device is also
Include:
Update module, diagnosis and treatment-standard for that will correct carry out supplement instruction to the corresponding model of pre- structure to as training corpus
Practice, and accuracy rate assessment is carried out to the corresponding model after training, replaces original model with the qualified corresponding model of assessment.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811551703.8A CN109697286A (en) | 2018-12-18 | 2018-12-18 | A kind of diagnostic standardization method and device based on term vector |
PCT/CN2019/080416 WO2020124856A1 (en) | 2018-12-18 | 2019-03-29 | Diagnosis standardization method and device based on word vectors |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811551703.8A CN109697286A (en) | 2018-12-18 | 2018-12-18 | A kind of diagnostic standardization method and device based on term vector |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109697286A true CN109697286A (en) | 2019-04-30 |
Family
ID=66232696
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811551703.8A Pending CN109697286A (en) | 2018-12-18 | 2018-12-18 | A kind of diagnostic standardization method and device based on term vector |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109697286A (en) |
WO (1) | WO2020124856A1 (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110209793A (en) * | 2019-06-18 | 2019-09-06 | 佰聆数据股份有限公司 | A method of for intelligent recognition text semantic |
CN110457689A (en) * | 2019-07-26 | 2019-11-15 | 科大讯飞(苏州)科技有限公司 | Semantic processes method and relevant apparatus |
CN110767296A (en) * | 2019-10-09 | 2020-02-07 | 北京雅丁信息技术有限公司 | Operation coding method based on semantic similarity |
CN110797101A (en) * | 2019-10-28 | 2020-02-14 | 腾讯医疗健康(深圳)有限公司 | Medical data processing method, device, readable storage medium and computer equipment |
CN111292814A (en) * | 2019-12-26 | 2020-06-16 | 北京亚信数据有限公司 | Medical data standardization method and device |
CN111383769A (en) * | 2020-01-08 | 2020-07-07 | 科大讯飞股份有限公司 | Method, device, equipment and storage medium for detecting complaint and diagnosis consistency |
CN111428477A (en) * | 2020-03-06 | 2020-07-17 | 安徽科大讯飞医疗信息技术有限公司 | Diagnostic name standardization method, device, electronic equipment and storage medium |
CN111599463A (en) * | 2020-05-09 | 2020-08-28 | 吾征智能技术(北京)有限公司 | Intelligent auxiliary diagnosis system based on sound cognition model |
CN111627512A (en) * | 2020-05-29 | 2020-09-04 | 北京大恒普信医疗技术有限公司 | Recommendation method and device for similar medical records, electronic equipment and storage medium |
CN111710409A (en) * | 2020-05-29 | 2020-09-25 | 吾征智能技术(北京)有限公司 | Intelligent screening system based on abnormal change of human sweat |
CN112022140A (en) * | 2020-07-03 | 2020-12-04 | 上海数创医疗科技有限公司 | Automatic diagnosis method and system for diagnosis conclusion of electrocardiogram |
CN113420541A (en) * | 2021-07-16 | 2021-09-21 | 四川医枢科技有限责任公司 | Information processing method, device, equipment and storage medium |
CN113593661A (en) * | 2021-07-07 | 2021-11-02 | 青岛国新健康产业科技有限公司 | Clinical term standardization method, device, electronic equipment and storage medium |
CN114548115A (en) * | 2022-02-23 | 2022-05-27 | 北京三快在线科技有限公司 | Method and device for explaining compound nouns and electronic equipment |
CN117034911A (en) * | 2023-09-28 | 2023-11-10 | 通用技术集团健康数字科技(北京)有限公司 | Correction method and device for hospital diagnosis dictionary, server and storage medium |
CN117196856A (en) * | 2023-08-11 | 2023-12-08 | 中国银行保险信息技术管理有限公司 | Processing method and device of claim information, storage medium and computer equipment |
CN117275752A (en) * | 2023-11-20 | 2023-12-22 | 中国人民解放军总医院 | Case clustering analysis method and system based on machine learning |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130103382A1 (en) * | 2011-10-19 | 2013-04-25 | Electronics And Telecommunications Research Institute | Method and apparatus for searching similar sentences |
CN105095188A (en) * | 2015-08-14 | 2015-11-25 | 北京京东尚科信息技术有限公司 | Sentence similarity computing method and device |
CN105653840A (en) * | 2015-12-21 | 2016-06-08 | 青岛中科慧康科技有限公司 | Similar case recommendation system based on word and phrase distributed representation, and corresponding method |
CN108509415A (en) * | 2018-03-16 | 2018-09-07 | 南京云问网络技术有限公司 | A kind of sentence similarity computational methods based on word order weighting |
CN108763477A (en) * | 2018-05-29 | 2018-11-06 | 厦门快商通信息技术有限公司 | A kind of short text classification method and system |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102169495B (en) * | 2011-04-11 | 2014-04-02 | 趣拿开曼群岛有限公司 | Industry dictionary generating method and device |
US20160110325A1 (en) * | 2014-10-20 | 2016-04-21 | Materials Engineering And Packaging, Llc | Method of Sharing Radiation Therapy Information to Non-Radiation Therapy Practitioners |
CN106682411B (en) * | 2016-12-22 | 2019-04-16 | 浙江大学 | A method of disease label is converted by physical examination diagnostic data |
CN106897568A (en) * | 2017-02-28 | 2017-06-27 | 北京大数医达科技有限公司 | The treating method and apparatus of case history structuring |
CN106933806A (en) * | 2017-03-15 | 2017-07-07 | 北京大数医达科技有限公司 | The determination method and apparatus of medical synonym |
-
2018
- 2018-12-18 CN CN201811551703.8A patent/CN109697286A/en active Pending
-
2019
- 2019-03-29 WO PCT/CN2019/080416 patent/WO2020124856A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130103382A1 (en) * | 2011-10-19 | 2013-04-25 | Electronics And Telecommunications Research Institute | Method and apparatus for searching similar sentences |
CN105095188A (en) * | 2015-08-14 | 2015-11-25 | 北京京东尚科信息技术有限公司 | Sentence similarity computing method and device |
CN105653840A (en) * | 2015-12-21 | 2016-06-08 | 青岛中科慧康科技有限公司 | Similar case recommendation system based on word and phrase distributed representation, and corresponding method |
CN108509415A (en) * | 2018-03-16 | 2018-09-07 | 南京云问网络技术有限公司 | A kind of sentence similarity computational methods based on word order weighting |
CN108763477A (en) * | 2018-05-29 | 2018-11-06 | 厦门快商通信息技术有限公司 | A kind of short text classification method and system |
Non-Patent Citations (2)
Title |
---|
王品: "信息检索中的句子相似度计算", 《计算机工程》 * |
王文辉: "基于相似度算法的英语智能问答系统设计与实现", 《计算机应用与软件》 * |
Cited By (25)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110209793A (en) * | 2019-06-18 | 2019-09-06 | 佰聆数据股份有限公司 | A method of for intelligent recognition text semantic |
CN110457689A (en) * | 2019-07-26 | 2019-11-15 | 科大讯飞(苏州)科技有限公司 | Semantic processes method and relevant apparatus |
CN110767296A (en) * | 2019-10-09 | 2020-02-07 | 北京雅丁信息技术有限公司 | Operation coding method based on semantic similarity |
CN110797101A (en) * | 2019-10-28 | 2020-02-14 | 腾讯医疗健康(深圳)有限公司 | Medical data processing method, device, readable storage medium and computer equipment |
CN110797101B (en) * | 2019-10-28 | 2023-11-03 | 腾讯医疗健康(深圳)有限公司 | Medical data processing method, medical data processing device, readable storage medium and computer equipment |
CN111292814A (en) * | 2019-12-26 | 2020-06-16 | 北京亚信数据有限公司 | Medical data standardization method and device |
CN111383769B (en) * | 2020-01-08 | 2024-04-12 | 科大讯飞股份有限公司 | Method, device, equipment and storage medium for detecting consistency of complaints and diagnoses |
CN111383769A (en) * | 2020-01-08 | 2020-07-07 | 科大讯飞股份有限公司 | Method, device, equipment and storage medium for detecting complaint and diagnosis consistency |
CN111428477A (en) * | 2020-03-06 | 2020-07-17 | 安徽科大讯飞医疗信息技术有限公司 | Diagnostic name standardization method, device, electronic equipment and storage medium |
CN111428477B (en) * | 2020-03-06 | 2023-10-17 | 讯飞医疗科技股份有限公司 | Diagnostic name standardization method, device, electronic equipment and storage medium |
CN111599463A (en) * | 2020-05-09 | 2020-08-28 | 吾征智能技术(北京)有限公司 | Intelligent auxiliary diagnosis system based on sound cognition model |
CN111710409A (en) * | 2020-05-29 | 2020-09-25 | 吾征智能技术(北京)有限公司 | Intelligent screening system based on abnormal change of human sweat |
CN111627512A (en) * | 2020-05-29 | 2020-09-04 | 北京大恒普信医疗技术有限公司 | Recommendation method and device for similar medical records, electronic equipment and storage medium |
CN112022140A (en) * | 2020-07-03 | 2020-12-04 | 上海数创医疗科技有限公司 | Automatic diagnosis method and system for diagnosis conclusion of electrocardiogram |
CN112022140B (en) * | 2020-07-03 | 2023-02-17 | 上海数创医疗科技有限公司 | Automatic diagnosis method and system for diagnosis conclusion of electrocardiogram |
CN113593661A (en) * | 2021-07-07 | 2021-11-02 | 青岛国新健康产业科技有限公司 | Clinical term standardization method, device, electronic equipment and storage medium |
CN113420541A (en) * | 2021-07-16 | 2021-09-21 | 四川医枢科技有限责任公司 | Information processing method, device, equipment and storage medium |
CN114548115B (en) * | 2022-02-23 | 2023-01-06 | 北京三快在线科技有限公司 | Method and device for explaining compound nouns and electronic equipment |
CN114548115A (en) * | 2022-02-23 | 2022-05-27 | 北京三快在线科技有限公司 | Method and device for explaining compound nouns and electronic equipment |
CN117196856A (en) * | 2023-08-11 | 2023-12-08 | 中国银行保险信息技术管理有限公司 | Processing method and device of claim information, storage medium and computer equipment |
CN117196856B (en) * | 2023-08-11 | 2024-06-25 | 中国银行保险信息技术管理有限公司 | Processing method and device of claim information, storage medium and computer equipment |
CN117034911A (en) * | 2023-09-28 | 2023-11-10 | 通用技术集团健康数字科技(北京)有限公司 | Correction method and device for hospital diagnosis dictionary, server and storage medium |
CN117034911B (en) * | 2023-09-28 | 2023-12-22 | 通用技术集团健康数字科技(北京)有限公司 | Correction method and device for hospital diagnosis dictionary, server and storage medium |
CN117275752A (en) * | 2023-11-20 | 2023-12-22 | 中国人民解放军总医院 | Case clustering analysis method and system based on machine learning |
CN117275752B (en) * | 2023-11-20 | 2024-03-22 | 中国人民解放军总医院 | Case clustering analysis method and system based on machine learning |
Also Published As
Publication number | Publication date |
---|---|
WO2020124856A1 (en) | 2020-06-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109697286A (en) | A kind of diagnostic standardization method and device based on term vector | |
CN106874643B (en) | Method and system for automatically constructing knowledge base to realize auxiliary diagnosis and treatment based on word vectors | |
CN110021439A (en) | Medical data classification method, device and computer equipment based on machine learning | |
JP3856778B2 (en) | Document classification apparatus and document classification method for multiple languages | |
US10540442B2 (en) | Evaluating temporal relevance in question answering | |
US20180349560A1 (en) | Monitoring the use of language of a patient for identifying potential speech and related neurological disorders | |
CN105975458B (en) | A kind of Chinese long sentence similarity calculating method based on fine granularity dependence | |
CN103119584B (en) | Machine translation evaluation device and method | |
CN104731774B (en) | Towards the personalized interpretation method and device of general machine translation engine | |
CN113724848A (en) | Medical resource recommendation method, device, server and medium based on artificial intelligence | |
CN110032631B (en) | Information feedback method, device and storage medium | |
CN106897384B (en) | Method and device for automatically evaluating key points | |
CN112541056A (en) | Medical term standardization method, device, electronic equipment and storage medium | |
US11663518B2 (en) | Cognitive system virtual corpus training and utilization | |
CN115858886B (en) | Data processing method, device, equipment and readable storage medium | |
CN108735198B (en) | Phoneme synthesizing method, device and electronic equipment based on medical conditions data | |
CN114548321A (en) | Self-supervision public opinion comment viewpoint object classification method based on comparative learning | |
CN111339285B (en) | BP neural network-based enterprise resume screening method and system | |
Whitney | Bootstrapping via graph propagation | |
CN113657109A (en) | Method, apparatus and computer device for standardization of model-based clinical terminology | |
CN111553140A (en) | Data processing method, data processing apparatus, and computer storage medium | |
CN111128388A (en) | Value domain data matching method and device and related products | |
CN113065355B (en) | Professional encyclopedia named entity identification method, system and electronic equipment | |
CN107122582A (en) | Towards the diagnosis and treatment class entity recognition method and device of multi-data source | |
CN116757195B (en) | Implicit emotion recognition method based on prompt learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
REG | Reference to a national code |
Ref country code: HK Ref legal event code: DE Ref document number: 40005490 Country of ref document: HK |
|
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190430 |
|
RJ01 | Rejection of invention patent application after publication |