CN109994215A

CN109994215A - Disease automatic coding system, method, equipment and storage medium

Info

Publication number: CN109994215A
Application number: CN201910338773.3A
Authority: CN
Inventors: 吴及; 周梦强; 刘喜恩; 吕萍
Original assignee: Tsinghua University; iFlytek Co Ltd
Current assignee: Tsinghua University; iFlytek Co Ltd
Priority date: 2019-04-25
Filing date: 2019-04-25
Publication date: 2019-07-09

Abstract

This application provides a kind of disease automatic coding system, method, equipment and storage mediums, wherein method includes: acquisition target object, and target object is disease name to be encoded or disease description；Coding relevant to target object is filtered out from disease code library, and candidate code collection is formed by the coding filtered out；The semantic relation that each coding corresponding disease name and target object is concentrated based on candidate code is concentrated from candidate code and determines the corresponding coding of target object.Disease automatic coding provided by the present application can it is automatic, the corresponding coding of disease name or disease description to be encoded is accurately and efficiently determined from disease code library.

Description

Disease automatic coding system, method, equipment and storage medium

Technical field

This application involves medical data coding techniques field more particularly to a kind of disease automatic coding systems, method, equipment And storage medium.

Background technique

International Classification of Diseases (international classification of disease, ICD) is as disease and has The international statistical classification standard for closing health problem, is the important component of health information standards system.Currently, ICD has 43 kinds The translation of different language, the whole world have 117 using the country of ICD, and ICD is widely used in medical institutions, medical insurance, population The departments such as management and patient information are collected and statistical analysis, and the Health Service Expenditure expenditure in the whole world about 70% carries out medical branch according to ICD Pay and Health Resource.

For the ease of being stored, being retrieved to disease data and analyzed, need according to ICD coding rule, by clinical diagnosis In disease name perhaps disease description is converted to coding and by the process that disease name or disease description are converted to coding is Disease code, the essence of disease code are to determine volume corresponding with disease name or disease description from disease code library Code.

Current disease code mode is mostly h coding's mode, i.e., is retouched by coder according to disease name or disease It states and determines coding corresponding with disease name or disease description from disease code library.However, h coding's mode subjectivity It is relatively strong, it will affect coding accuracy, and h coding's mode is time-consuming and laborious, i.e. cost of labor and time cost is higher.

Summary of the invention

In view of this, this application provides a kind of disease automatic coding system, method, equipment and storage mediums, to solve H coding's mode certainly in the prior art is subjective, will affect coding accuracy, and h coding's mode is time-consuming and laborious, Lead to cost of labor and the higher problem of time cost, its technical solution is as follows:

A kind of disease automatic coding, comprising:

Target object is obtained, the target object is disease name or disease description；

Coding relevant to the target object is filtered out from disease code library, and candidate volume is formed by the coding filtered out Code collection；

The corresponding disease name of each candidate code and the semantic of the target object is concentrated to close based on the candidate code System concentrates from the candidate code and determines the corresponding coding of the target object.

It is optionally, described that coding relevant to the target object is filtered out from disease code library, comprising:

Based on the corresponding disease name of codings all kinds of in the target object and the disease code library, the target is determined Target text statistical nature of the object for all kinds of codings, wherein the target text that the target object encodes any sort Statistical nature is used to characterize the degree of correlation of such coding and the target object；

Based on the target object for the target text statistical nature of all kinds of codings, screened from the disease code library Coding relevant to the target object out.

It is optionally, described based on the corresponding disease name of codings all kinds of in the target object and the disease code library, Determine the target object for the target text statistical nature of all kinds of codings, comprising:

Coding in the disease code library is classified, multiple coded sets, the corresponding volume of each coded set are obtained Code classification；

Based on the target object and the corresponding disease name of each coded set, determine the target object for each volume The the first text statistical nature, and/or the second text statistical nature, and/or third text statistical nature, and/or the 4th of code collection Text statistical nature；Wherein, the corresponding disease name of any coded set includes the corresponding disease name of each coding in the coded set Claim, the target object counts the first text statistical nature, the second text statistical nature, third text of any coded set Feature, the 4th text statistical nature are respectively used to characterize each word in the target object and appear in the corresponding disease of the coded set Name of disease claim in frequency, each word in the target object appears in the document of the coded set corresponding disease name composition Term frequency-inverse document frequency, the text similarity of target object disease name corresponding with the coded set, the target pair As the matching degree of keyword and qualifier in disease name corresponding with the coded set；

Special is counted for the first text statistical nature of each coded set, and/or the second text based on the target object Sign, and/or third text statistical nature, and/or the 4th text statistical nature, determine the target object for all kinds of codings Target text statistical nature.

Optionally, for any coded set, it is based on target object disease name corresponding with the coded set, determines institute Target object is stated for the first text statistical nature of the coded set, comprising:

Obtain the weight of each word in the first word set, wherein first word set is by carrying out at duplicate removal the second word set Reason obtains, and second word set is the collection that the word obtained after word segmentation processing composition is carried out to the corresponding disease name of the coded set It closes, time that the weight of each word occurs in second word set by each word in first word set in first word set Number determines；

Target word set is obtained, and determines that the target word concentrates each word based on the weight of each word in first word set Weight, wherein the target word set be to the target object carry out word segmentation processing after obtain word composition set；

The weight that each word is concentrated by the target word determines the target object for the first text of the coded set Statistical nature.

Optionally, for any coded set, it is based on target object disease name corresponding with the coded set, determines institute Target object is stated for the second text statistical nature of the coded set, comprising:

The corresponding disease document of the coded set is obtained, the corresponding disease document of the coded set is by the corresponding disease of the coded set Title forms；

Target word set is obtained, and each word for determining that the target word is concentrated appears in the corresponding disease document of the coded set Term frequency-inverse document frequency, wherein the target word set is that obtained word forms after carrying out word segmentation processing to the target object Set；

The term frequency-inverse document frequency of the corresponding disease document of the coded set is appeared in by each word that the target word is concentrated Degree, determines the target object for the second text statistical nature of the coded set.

Optionally, for any coded set, it is based on target object disease name corresponding with the coded set, determines institute Target object is stated for the third text statistical nature of the coded set, comprising:

Calculate separately the editing distance of target object disease name corresponding with the coded set；

By the editing distance of target object disease name corresponding with the coded set, the target object pair is determined In the third text statistical nature of the coded set.

Optionally, for any coded set, it is based on target object disease name corresponding with the coded set, determines institute Target object is stated for the 4th text statistical nature of the coded set, comprising:

Obtain the corresponding attributed graph of the coded set, wherein the attributed graph includes main word and attribute word, the main word For the keyword in the corresponding disease name of the coded set, the attribute word is the qualifier of the main word；

By in target object attributed graph corresponding with the coded set main word and attribute word match；

Based on the match condition of target object attributed graph corresponding with the coded set, determine the target object for 4th text statistical nature of the coded set.

Optionally, described that the corresponding disease name of each candidate code and the target pair are concentrated based on the candidate code The semantic relation of elephant is concentrated from the candidate code and determines the corresponding coding of the target object, comprising:

Based on the semantic analog information of each candidate code corresponding disease name and the target object, institute is determined State the semantic vector of the corresponding disease name of each candidate code；

Based on the semantic vector of the corresponding disease name of each candidate code, is concentrated from the candidate code and determine institute State the corresponding coding of target object.

Optionally, described similar to the semanteme of the target object based on the corresponding disease name of each candidate code Information determines the semantic vector of the corresponding disease name of each candidate code, comprising:

For any candidate code:

Semanteme based on each character in each character in the corresponding disease name of the candidate code and the target object Similarity determines the semantic weight of each character in the corresponding disease name of the candidate code；

Based on the semantic vector and semantic weight of each character in the corresponding disease name of the candidate code, the coding is determined The semantic vector of corresponding disease name；

To obtain the semantic vector of the corresponding disease name of each candidate code.

Optionally, it is described based on each character in the corresponding disease name of the candidate code with it is each in the target object The semantic similarity of character determines the semantic weight of each character in the corresponding disease name of the candidate code, comprising:

It determines every in the semantic vector and the target object of each character in the corresponding disease name of the candidate code The semantic vector of a character；

Any character in disease name corresponding for the candidate code calculates separately semantic vector and the institute of the character The similarity for stating the semantic vector of each character of target object makees the maximum similarity in the multiple similarities being calculated For the semantic weight of the character, to obtain the semantic weight of each character in the corresponding disease name of the coding.

Optionally, the semantic vector based on the corresponding disease name of each candidate code candidate is compiled from described The corresponding coding of the target object is determined in code collection, comprising:

By the semantic vector of the corresponding disease name of each candidate code, determine each candidate code Point, wherein the score of any candidate code can characterize the language of the candidate code corresponding disease name and the target object Adopted similarity degree；

The candidate code of highest scoring is determined as the corresponding coding of the target object.

A kind of disease automatic coding system, comprising: obtain module, coding scalping module, coding dusting cover module；

The acquisition module, for obtaining target object, the target object is disease name or disease description；

The coding scalping module, for filtering out coding relevant to the target object from disease code library, by The coding composition candidate code collection filtered out；

The coding dusting cover module, for based on the candidate code concentrate the corresponding disease name of each candidate code with The semantic relation of the target object is concentrated from the candidate code and determines the corresponding coding of the target object.

Optionally, the coding scalping module includes: characteristic determination module and correlative coding screening module；

The characteristic determination module, for corresponding based on all kinds of codings in the target object and the disease code library Disease name determines the target object for the target text statistical nature of all kinds of codings, wherein the target object for The target text statistical nature of any sort coding is used to characterize the degree of correlation of such coding and the target object；

The correlative coding screening module, for counting special based on target text of the target object for all kinds of codings Sign, filters out coding relevant to the target object from the disease code library.

Optionally, the characteristic determination module includes: that coding specification submodule, fisrt feature determine submodule and second special It levies and determines submodule；

The coding specification submodule obtains multiple codings for the coding in the disease code library to be classified Collection, the corresponding coding sorts of each coded set；

The fisrt feature determines submodule, for being based on the target object and the corresponding disease name of each coded set Claim, determine the target object for the first text statistical nature of each coded set, and/or the second text statistical nature and/ Or third text statistical nature, and/or the 4th text statistical nature；Wherein, the corresponding disease name of any coded set includes being somebody's turn to do The corresponding disease name of each coding in coded set, the target object for any coded set the first text statistical nature, Second text statistical nature, third text statistical nature, the 4th text statistical nature are respectively used to characterize in the target object Each word appear in the frequency in the corresponding disease name of the coded set, each word in the target object appears in the volume Term frequency-inverse document frequency, the target object in the document of the corresponding disease name composition of code collection is corresponding with the coded set Keyword and qualifier in the text similarity of disease name, the target object disease name corresponding with the coded set Matching degree；

The second feature determines submodule, for being united based on first text of the target object for each coded set Feature, and/or the second text statistical nature, and/or third text statistical nature, and/or the 4th text statistical nature are counted, is determined Target text statistical nature of the target object for all kinds of codings.

Optionally, the fisrt feature determines submodule for any coded set, based on the target object and the coding Collect corresponding disease name, when determining first text statistical nature of the target object for the coded set, be specifically used for:

Obtain the weight of each word in the first word set, wherein first word set is by carrying out at duplicate removal the second word set Reason obtains, and second word set is the collection that the word obtained after word segmentation processing composition is carried out to the corresponding disease name of the coded set It closes, time that the weight of each word occurs in second word set by each word in first word set in first word set Number determines；Target word set is obtained, and determines that the target word concentrates each word based on the weight of each word in first word set Weight, wherein the target word set be to the target object carry out word segmentation processing after obtain word composition set；Pass through The target word concentrates the weight of each word, determines the target object for the first text statistical nature of the coded set.

Optionally, the fisrt feature determines submodule for any coded set, based on the target object and the coding Collect corresponding disease name, when determining second text statistical nature of the target object for the coded set, be specifically used for:

The corresponding disease document of the coded set is obtained, the corresponding disease document of the coded set is by the corresponding disease of the coded set Title forms；Target word set is obtained, and determines that each word that the target word is concentrated appears in the corresponding disease text of the coded set The term frequency-inverse document frequency of shelves, wherein the target word set is the phrase for obtain after word segmentation processing to the target object At set；The term frequency-inverse document of the corresponding disease document of the coded set is appeared in by each word that the target word is concentrated Frequency determines the target object for the second text statistical nature of the coded set.

Optionally, the fisrt feature determines submodule for any coded set, based on the target object and the coding Collect corresponding disease name, when determining third text statistical nature of the target object for the coded set, be specifically used for:

Calculate separately the editing distance of target object disease name corresponding with the coded set；Pass through the target pair As the editing distance of disease name corresponding with the coded set, determine that the target object unites for the third text of the coded set Count feature.

Optionally, the fisrt feature determines submodule for any coded set, based on the target object and the coding Collect corresponding disease name, when determining fourth text statistical nature of the target object for the coded set, be specifically used for:

Obtain the corresponding attributed graph of the coded set, wherein the attributed graph includes main word and attribute word, the main word For the keyword in the corresponding disease name of the coded set, the attribute word is the qualifier of the main word；By the target Main word and attribute word in object attributed graph corresponding with the coded set are matched；Based on the target object and the coding The match condition for collecting corresponding attributed graph determines the target object for the 4th text statistical nature of the coded set.

Optionally, the coding dusting cover module includes: semantic vector determining module and coding screening module；

The semantic vector determining module, for based on each corresponding disease name of candidate code and the target The semantic analog information of object determines the semantic vector of the corresponding disease name of each candidate code；

The coding screening module, for the semantic vector based on the corresponding disease name of each candidate code, from The candidate code, which is concentrated, determines the corresponding coding of the target object.

Optionally, the semantic vector determining module includes: that weight determines that submodule and semantic vector determine submodule；

The weight determines submodule, for being based on the corresponding disease name of the candidate code for any candidate code In in each character and the target object each character semantic similarity, determine in the corresponding disease name of the candidate code The semantic weight of each character；

The semantic vector determines submodule, for the language based on each character in the corresponding disease name of the candidate code Adopted vector sum semantic weight, determines the semantic vector of the corresponding disease name of the coding.

Optionally, the weight determines submodule, is specifically used for determining each in the corresponding disease name of the candidate code The semantic vector of each character in the semantic vector of character and the target object；Disease name corresponding for the candidate code Any character in title calculates separately the phase of the semantic vector of the semantic vector of the character and each character of the target object Like degree, using the maximum similarity in the multiple similarities being calculated as the semantic weight of the character, to obtain the coding pair The semantic weight of each character in the disease name answered.

Optionally, the coding screening module, specifically for passing through the corresponding disease name of each candidate code Semantic vector, determines the score of each candidate code, and the candidate code of highest scoring is determined as the target object pair The coding answered；Wherein, the score of any candidate code can characterize the corresponding disease name of the candidate code and the target pair The semantic similarity degree of elephant.

A kind of disease autocoding equipment, comprising: memory and processor；

The memory, for storing program；

The processor realizes each step of the disease automatic coding for executing described program.

A kind of readable storage medium storing program for executing is stored thereon with computer program, real when the computer program is executed by processor Each step of the existing disease automatic coding.

Disease automatic coding system, method, equipment and storage medium provided by the present application, obtain target object (i.e. to The disease name or disease description of coding) after, progress scalping first filters out related to target object from disease code library Coding, to obtain candidate code collection, then further progress dusting cover concentrates the corresponding disease of each coding based on candidate code Name of disease claims the semantic relation with target object, concentrates from candidate code and determines the corresponding coding of target object.Via the above process It is found that disease automatic coding provided by the present application can determine the corresponding volume of target object from disease code library automatically Code, compared to h coding's mode, not only saves manpower, reduces coding time, and avoid subjective factors to volume The influence of code accuracy, in addition, concentrating the corresponding disease name of each coding and the semantic of target object to close based on candidate code System accurately can determine the corresponding coding of target object from candidate code concentration.

Detailed description of the invention

In order to illustrate the technical solutions in the embodiments of the present application or in the prior art more clearly, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of application for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.

Fig. 1 is the flow diagram of disease automatic coding provided by the embodiments of the present application；

Fig. 2 filters out coding relevant to target object to be provided by the embodiments of the present application from code database, and composition is candidate The flow diagram of the realization process of coded set；

Fig. 3 is an exemplary schematic diagram of the corresponding attributed graph of a coded set provided by the embodiments of the present application；

Fig. 4 is provided by the embodiments of the present application based on the corresponding disease name of each candidate code of candidate code concentration and mesh The semantic relation for marking object concentrates the flow diagram for determining the realization process of the corresponding coding of target object from candidate code；

Fig. 5 is provided by the embodiments of the present application the one specific of the corresponding coding of target object to be determined from disease code library Exemplary schematic diagram；

Fig. 6 is the structural schematic diagram of disease automatic coding system provided by the embodiments of the present application；

Fig. 7 is the structural schematic diagram of disease autocoding equipment provided by the embodiments of the present application.

Specific embodiment

Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of embodiments of the present application, instead of all the embodiments.It is based on Embodiment in the application, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall in the protection scope of this application.

Inventor has found during realizing the application: existing disease code scheme is mostly h coding's scheme, but The coding quality of h coding's scheme depends on the specialized capability of coder, and it is at high cost to cultivate outstanding coder, and Understand differently cause of the different coders to coding thickness, code database and judgment criteria, this causes coding result inconsistent, then Person, coding work is many and diverse, h coding's inefficiency, and the case where be easy to appear code error.

In view of h coding's scheme there are the problem of, inventor is studied:

Originally thinking is, using the autocoding scheme based on retrieval technique, specifically, constructing a disease art first Repertorie, which mainly includes the title of disease, abridge, be commonly called as, local title etc., then passes through natural language Technology carries out participle and keyword abstraction to disease name, inquires in disease terminology bank finally by the means of keyword retrieval The most like standard terminology with disease name or disease description to be encoded is then based on standard terminology from disease code library Determine the corresponding coding of disease name or disease description to be encoded.

However, the autocoding scheme based on retrieval technique can be retrieved in character string level it is many similar as a result, example Such as, disease name be " foreign body in esophagus ", based on " foreign body in esophagus " can retrieve " ear canal foreign matter ", " foreign bodies in digestive tract ", " in oesophagus Multiple analog results such as foreign matter ", the program are difficult to select correct result in the above results.

In view of above-mentioned thinking there are some problems, inventor is made further research, and is proposed based on classification skill The autocoding scheme of art, the thought of the program are to regard disease code as a classification task, the coding in disease code library As the label of the classification task, by the complicated neural network model of building to the disease name of input or disease description into The prediction of row tag along sort.

Although autocoding scheme based on sorting technique is capable of determining that disease name or disease description to be encoded Corresponding coding, but the program is due to needing to carry out classification prediction by complicated neural network model, calculate complexity Spend high, and the program can only carry out tens kinds of a small amount of diseases classification prediction (encode), and when the classification for needing to classify When increasing, computation complexity will be unable to bear, and the performance of neural network model can sharply decline, and this scheme lacks robust Property and practicability.

In view of above scheme there are the problem of, the further investigation of inventor further progress finally proposes one kind The preferable disease automatic coding of effect, the disease automatic coding can it is automatic, accurately and efficiently from disease code library In determine disease to be encoded or the corresponding coding of disease description, which is applicable to determine disease name Claim or the scene of the corresponding coding of disease description, the disease automatic coding can be applied to terminal, can also be applied to server. Next disease automatic coding provided by the present application is introduced.

Referring to Fig. 1, showing the flow diagram of disease automatic coding provided by the embodiments of the present application, this method May include:

Step S101: target object is obtained.

Wherein, target object is disease name or disease description to be encoded.

Step S102: coding relevant to target object is filtered out from disease code library, is made of the coding filtered out Candidate code collection.

It should be noted that including that multiclass encodes in code database, the purpose of this step is to screen from disease code library Coding sorts relevant to target object out, to all codings under coding sorts relevant to target object be formed candidate Coded set.

Specifically, can be based on the text statistical nature of target object disease name corresponding for all kinds of codings, from disease Coding relevant to target object is filtered out in code database.

Step S103: concentrating the semantic relation of each coding corresponding disease name and target object based on candidate code, It is concentrated from candidate code and determines the corresponding coding of target object.

Specifically, the corresponding disease name of each coding letter similar to the semanteme of target object can be concentrated based on candidate code Breath is concentrated from candidate code and determines the corresponding coding of target object.

Disease automatic coding provided by the embodiments of the present application is obtaining target object (disease name i.e. to be encoded Or disease description) after, progress scalping first filters out coding relevant to target object from disease code library, to obtain Candidate code collection, then further progress dusting cover, i.e., concentrate the corresponding disease name of each coding and target based on candidate code The semantic relation of object is concentrated from candidate code and determines the corresponding coding of target object.Via the above process it is found that the application is real The disease automatic coding for applying example offer can determine the corresponding coding of target object from disease code library automatically, compared to H coding's mode not only saves manpower, reduces coding time, and avoids subjective factors to coding accuracy It influences, in addition, the text statistical nature based on target object disease name corresponding for all kinds of codings, it can be rapidly from disease Coding relevant to target object is filtered out in sick code database, and the corresponding disease name of each coding is concentrated based on candidate code With the semantic relation of target object, the corresponding coding of target object, i.e. this Shen accurately can be determined from candidate code concentration Please embodiment provide disease automatic coding can automatically, efficiently and accurately determine the corresponding coding of target object.

In another embodiment of the application, " filtered out from code database and mesh in the step S102 of above-described embodiment The realization process of the relevant coding of mark object " is introduced.

Coding relevant to target object is filtered out from code database referring to Fig. 2, showing, composition candidate code collection Realization process may include:

Step S201: based on the corresponding disease name of codings all kinds of in target object and disease code library, target pair is determined As the target text statistical nature for all kinds of codings.

Wherein, target object encodes and target pair the target text statistical nature that any sort encodes for characterizing such The degree of correlation of elephant.

In one possible implementation, the realization process of step S201 may include:

Step S2011, the coding in disease code library is classified, obtains multiple coded sets.

Wherein, the corresponding coding sorts of each coded set.

It should be noted that ICD coding is divided into 3 codes of ICD, ICD 4 by number of encoding bits in ICD coding scheme 6 code, ICD codes, wherein 6 codes of 4 codes of ICD and ICD are the suborder code and extended code of ICD, are the detailed of 3 codes of ICD The front three of disaggregated classification, 4 codes of ICD and 6 codes of ICD is 3 codes of ICD.

The present embodiment can classify to all 6 codes of ICD based on ICD3 codes, can be by preceding 3 identical ICD 6 codes are divided into a kind of composition coded set, for example, one kind can be merged into for 6 codes of ICD of E32 by first 3, obtain a volume 6 codes of ICD that front three is K74 can be merged into one by code collection { E32.000, E32.001, E32.002, E32.100 ... } Class obtains a coded set { K74.000, E74.001 ... }, so can get multiple coded sets, and each coded set includes multiple Front three 6 codes of identical ICD.

Step S2012, it is based on target object and the corresponding disease name of each coded set, determines target object for each The the first text statistical nature, and/or the second text statistical nature, and/or third text statistical nature of coded set, and/or Four text statistical natures.

Wherein, the corresponding encoding name of any coded set includes the corresponding disease name of each coding in the coded set.

Wherein, target object is each in target object for characterizing for the first text statistical nature of any coded set Word appears in the frequency in the corresponding disease name of the coded set；Target object counts special for the second text of any coded set It takes over each word in characterization target object for use and appears in the word frequency-in the document of the corresponding disease name composition of the coded set Inverse document frequency；Target object is used to characterize target object and the coded set for the third text statistical nature of any coded set The text similarity of corresponding disease name；Target object is used to characterize mesh for the 4th text statistical nature of any coded set Mark the matching degree of the keyword and qualifier in object encoding name corresponding with the coded set.

Based on target object and the corresponding disease name of each coded set, determine target object for the of each coded set One text statistical nature, and/or the second text statistical nature, and/or third text statistical nature, and/or the 4th text statistics Feature can be found in the explanation of subsequent embodiment.

Step S2013, based on target object for the first text statistical nature of each coded set, and/or the second text Statistical nature, and/or third text statistical nature, and/or the 4th text statistical nature, determine target object for all kinds of codings Target text statistical nature.

It in one possible implementation, can be by target object for the first of the coded set for any coded set Text statistical nature, the second text statistical nature, third text statistical nature, in the 4th text statistical nature any one or Multiple (preferably several) input Multilayer Perception network MLP obtain target object and count special for the target text of the coded set Sign, to can get target object for the target text statistical nature of each coded set, target object is for each coded set Target text statistical nature be target text statistical nature of the target object for all kinds of codings.

Step S202: it based on target object for the target text statistical nature of all kinds of codings, is sieved from disease code library Select coding relevant to target object.

Specifically, target object is target object for such coding for the target text statistical nature that any sort encodes Score.In one possible implementation, can to choose N number of score in the score from target object for all kinds of codings (N number of Score is all larger than other scores), using all codings under the corresponding coding sorts of N number of score as related to target object Coding, for example, can the target object of score descending sort by to(for) all kinds of codings, take the corresponding coding of top n score All codings under classification are as coding relevant to target object, wherein N can be set based on actual conditions.

Individually below to target object and the corresponding disease name of each coded set is based on, determine target object for each First text statistical nature of coded set, the second text statistical nature, third text statistical nature, the 4th text statistical nature into Row is introduced.

Since the corresponding first text statistical nature of each coded set, the second text statistical nature, third text statistics are special Sign, the method for determination of the 4th text statistical nature are identical, below by taking a coded set C as an example, really to each text statistical nature Determine process to be introduced.

Based on target object and the corresponding disease name of coded set C, determine target object for the first text of coded set C The process of statistical nature includes:

Step a1, the weight of each word in the first word set is obtained.

Wherein, the first word set is obtained by carrying out duplicate removal processing to the second word set, and the second word set is corresponding to coded set C Disease name carries out the set of the word obtained after word segmentation processing composition, and the weight of each word passes through in the first word set in the first word set The number that each word occurs in the second word set determines.

It should be noted that be usually fixed due to the weight of each word in the first word set, it is every in the first word set The weight of a word can be predefined and be stored, and when needing to encode target object, directly acquired and used, certainly, It can be determined when being encoded to target object.

In the present embodiment, the process for determining the weight of each word in the first word set may include: to each in coded set C It encodes corresponding disease name to carry out word segmentation processing (remove stop words, go punctuate, synonym replacement etc.), be obtained after word segmentation processing Word forms the second word set；Duplicate removal processing is carried out to the second word set, obtains the first word set；Each word is counted in the first word set second The number occurred in word set；Based on the number that each word in the first word set occurs in the second word set, determine every in the first word set The weight of a word.

Illustratively, coded set C includes coding respectively encodes corresponding disease name, respectively encodes corresponding disease name Word segmentation processing result it is as shown in table 1 below:

The relevant information of 1 one coded set of table

Word segmentation processing result in table 1 forms the second word set, after carrying out duplicate removal processing to the word in the second word set, can get First word set { thymus gland, hyperplasia, thymopathy is congenital, atrophy, other, abscess, tumour, loose, duration, disease }, statistics the The number that each word occurs in the second word set in one word set is counted each word in available first word set in the second word set The number of middle appearance is respectively as follows: 7,2,1,1,1,1,1,1,1,1,1, is determined based on the number counted each in the first word set The weight of word, specifically, the total degree that occurs in the second word set of each word in the first word set is 18, then the power of word " thymus gland " Weight is 7/18, and the weight of word " hyperplasia " is 2/18, and the weight of other words is 1/18.The following table 2 shows each in the first word set The frequency of occurrence and weight of word:

The frequency of occurrence and weight of each word in 2 first word set of table

Step a2, target word set is obtained.

Wherein, target word set is that the set of the word obtained after word segmentation processing composition is carried out to target object.

Illustratively, target object is " vertical diaphragm occupy-place: thymic hyperplasia ", then divides " vertical diaphragm occupy-place: thymic hyperplasia " Word handles to obtain " vertical diaphragm/occupy-place/thymus gland/hyperplasia ", then target word set is { vertical diaphragm, occupy-place, thymus gland, hyperplasia }.

Step a3, the weight based on each word in the first word set determines that target word concentrates the weight of each word.

Illustratively, target word set is { vertical diaphragm, occupy-place, thymus gland, hyperplasia }, the weight such as table 2 of each word in the first word set It is shown, be 0.39 by the weight that table 2 can get " thymus gland ", the weight of " hyperplasia " is 0.11, and do not have in table 2 " vertical diaphragm " and " occupy-place ", then " vertical diaphragm " and the weight of " occupy-place " are 0.

Step a4, the weight that each word is concentrated by target word determines that target object unites for the first text of coded set C Count feature.

Specifically, the weight of each word can be concentrated to sum to target word, the value summed is as target object for compiling The first text statistical nature of code collection C.Illustratively, in target word set { vertical diaphragm, occupy-place, thymus gland, hyperplasia } each word weight Respectively 0,0,0.39,0.11, then 0+0+0.39+0.11=0.5 is counted as the first text of the target object for coded set Feature.

Based on target object and the corresponding disease name of coded set C, determine target object for the second text of coded set C The process of statistical nature may include:

Step b1, the corresponding disease document of coded set C is obtained.

Wherein, the corresponding disease document of coded set C is made of the corresponding disease name of coded set C, that is, will be in coded set C One document of each corresponding disease name composition of coding, as the corresponding disease document of coded set C.

Step b2, target word set is obtained.

Step b2, determine that each word of target word concentration appears in the term frequency-inverse document of the corresponding disease document of coded set C Frequency.

Specifically, calculating the word first for any word that target word is concentrated and appearing in the corresponding disease document of coded set C Word frequency and the word then the corresponding disease of coded set C is appeared in by the word for the inverse document frequency of target corpus The word frequency of sick document and the word determine that the word appears in the corresponding disease of coded set C for the inverse document frequency of target corpus The term frequency-inverse document frequency of sick document.Wherein, include the corresponding disease document of each coded set in target corpus, need It is bright, it further include disease in each disease document in target corpus other than including the corresponding disease name of coded set Title is extended, for example, including each in coded set X encode in corresponding disease name and coded set X respectively in disease document T A extension title for encoding corresponding disease name.

Further, any word t that can determine that target word is concentrated by following formula appears in the corresponding disease document of coded set C The word frequency of d:

Wherein, n_tFor the number that the word t that target word is concentrated occurs in the corresponding disease document d of coded set C, N_dFor coding Collect the total quantity of word in the corresponding disease document d of C, TF (t, d) is the word that word t appears in the corresponding disease document d of coded set C Frequently.It should be noted that in actual use, it usually needs TF (t, d) is normalized:

Wherein, t' is the word in the corresponding disease document d of coded set C.

Inverse document frequency of any word t for target corpus D of target word concentration can be determined by following formula:

Wherein, M is the sum of document in target corpus D, m_tFor in target corpus D include word t document quantity, IDF (t, D) is inverse document frequency of the word t for target corpus D of target word concentration.

Obtain TF'(t, d) and IDF (t, D) after, by following formula determine target word concentration any word t appear in coding Collect the term frequency-inverse document frequency TF-IDF (t, d, D) of the corresponding disease document d of C:

TF-IDF (t, d, D)=TF'(t, d) * IDF (t, D) (4)

The word of the corresponding disease document of coded set C is appeared in via each word that the above process can get target word concentration Frequently-inverse document frequency.

Step b3, each word is concentrated to appear in the term frequency-inverse document of the corresponding disease document of coded set C frequently by target word Degree, determines target object for the second text statistical nature of coded set C.

Specifically, each word can be appeared in the term frequency-inverse document frequency summation of the corresponding disease document of coded set C, ask The second text statistical nature with obtained value as target object for coded set C.

For any coded set, it is based on target object and the corresponding disease name of coded set C, determines target object for compiling The process of the third text statistical nature of code collection C may include:

Step c1, the editing distance of coded set C corresponding disease name and target object is calculated separately.

It should be noted that the editing distance between text refers to, it is that another text needs most by a text conversion Small modifications number of words is the quantizating index for measuring two text difference degrees or similarity degree.

This step is come by the editing distance of coding each in calculation code collection C corresponding disease name and target object Determine each text difference degree (or similarity degree) for encoding corresponding disease name and target object in coded set C, it is excellent Choosing, can in calculation code collection C each coding corresponding disease name and disease extension title and target object editing distance.

Step c2, editing distance obtained by calculation determines that target object counts special for the third text of coded set C Sign.

Specifically, the smallest edit distance in the editing distance being calculated can be determined as target object for coded set The third text statistical nature of C.

It should be noted that if some in coded set C encodes corresponding disease name or disease extension title and target pair The editing distance of elephant is 0, shows that the corresponding disease name of the coding or disease extension title and target object are completely the same, then directly It connects and the coding is determined as the corresponding coding of target object.

Based on target object and the corresponding disease name of coded set C, determine target object for the 4th text of coded set C Statistical nature, comprising:

Step d1, the corresponding attributed graph of coded set C is obtained.

Wherein, attributed graph includes main word and attribute word, and attributed graph is able to reflect the relationship of main word and attribute word, is dominated Word is the keyword in the corresponding disease name of coded set C, and attribute word is the qualifier of main word.

In the present embodiment, corresponding attributed graph can be constructed for coded set C in advance, specifically, corresponding to for coded set C building The process of attributed graph may include:

1) main word, is determined from the corresponding disease name of coded set C.

Wherein, main word is the keyword in disease name, for example, " meningitis " in " tubercular meningitis " is leading Word, for another example, " respiratory failure " in " chronic respiratory failure " are main word.

2) the attribute word for modifying main word, is determined from the corresponding disease name of coded set, and determines each attribute The weight of word.

In one possible implementation, it can be based on attribute predetermined, from the corresponding disease name of coded set It determines the attribute word for modifying main word, for each attribute predetermined, can get an attribute word set.

Wherein, attribute predetermined may include with one of properties or a variety of: the cause of disease, pathology, position, by stages Or severity, primary or secondary, multiple or single-shot, acute and chronic, pathogen, age or period, sequelae, complication etc..

For any attribute word, can be determined based on the number that the attribute word occurs in the corresponding disease name of coded set C The weight of the attribute word, to obtain the weight of each attribute word.

3), attributed graph is constructed using the weight for main word, attribute word and the attribute word determined.

Optionally, the form of attributed graph can be with are as follows: centered on main word, the corresponding attribute word of each attribute predetermined Information is branch, the corresponding attribute word information of an attribute (including the corresponding attribute set of words of the attribute and the attribute word set The weight of each attribute word in conjunction) it is a branch.

Referring to Fig. 3, the schematic diagram of the corresponding attributed graph of coded set for the coding composition that front three is K74 is shown, by The center that Fig. 3 can be seen that attributed graph is main word " cirrhosis ", and each of main word branches into each attribute predetermined The power of each attribute word in (such as case, the cause of disease, pathogen, complication etc.) corresponding attribute set of words and attribute set of words Weight, for example, one of them branches into corresponding with attribute " pathology " attribute word set { liver fibrosis, biliar, gangrenosum acne, tubercle Property ... }, in attribute word set { liver fibrosis, biliar, gangrenosum acne, nodositas ... } weight of each attribute word be respectively 0.2, 0.3、0.1、0.1、…。

Step d2, by target object attributed graph corresponding with coded set C main word and attribute word match, be based on The match condition of target object attributed graph corresponding with coded set C determines that target object counts the 4th text of coded set C Feature.

There are many implementations of step d2, in one possible implementation:

Word segmentation processing is carried out to target object first, obtains multiple target words；Then, it determines to lead from multiple target words Introductory word, and the main word is matched with the main word in attributed graph, if successful match, show that there are in attributed graph in target object Main word then the target word in addition to main word is matched with the attribute word in attributed graph；By in addition to main word The attribute word of target word successful match sums to the weight of objective attribute target attribute word, summation obtains value as mesh as objective attribute target attribute word Object is marked for the 4th text statistical nature of coded set C.If from the main word and attributed graph determined in multiple target words Main word it fails to match, show that there is no the main words in attributed graph in target object, it is determined that target object is for coding The 4th text statistical nature for collecting C is 0.

Above-mentioned implementation needs choose main word from target object, it is to be appreciated that main word is most important, such as The selection of fruit main word is improper will to will affect the 4th text statistical nature, and then influence the accuracy of next code screening, in view of This can be without the selection of main word, using each word in target object as main word in alternatively possible realization It is matched, it is specific: word segmentation processing being carried out to target object first, obtains multiple target words；Then, to multiple target words into Row traversal: using the word currently traversed as main word, being matched with the main word in attributed graph, will if successful match Target word in addition to main word is matched with the attribute word in attributed graph, by with the target word successful match in addition to main word The corresponding weight summation of attribute word, summation obtain value and are used as a candidate feature, if the word currently traversed as main word, with It fails to match for main word in attributed graph, then is used as a candidate feature for 0, then traverses next target word, until traversal Complete all target words obtain multiple candidate features by above-mentioned ergodic process, will be worth maximum candidate feature as target pair As the 4th text statistical nature for coded set C.

It should be noted that above-mentioned carry out coding sieve based on target text statistical nature of the target object for all kinds of codings Choosing, is substantially to be screened according to the feature of character level, this screening mode is difficult to filter out mesh from disease code library The corresponding coding of object is marked, coding relevant to target object can only be filtered out, in view of this, the application is further in semantic layer Face is screened, i.e., the semantic relation of each candidate code corresponding disease name and target object is concentrated based on candidate code, The corresponding coding of target object is determined from candidate code concentration.

Below to the semantic relation for concentrating each candidate code corresponding disease name and target object based on candidate code, It is concentrated from candidate code and determines that the realization process of the corresponding coding of target object is introduced, which may include:

Step S401: the semantic analog information based on the corresponding disease name of each candidate code and target object determines The semantic vector of the corresponding disease name of each candidate code.

Specifically, for any candidate code c, the semanteme based on candidate code c corresponding disease name and target object Analog information determines that the process of the semantic vector of the corresponding disease name of candidate code c may include:

Step S4011, each character based on each character and target object in the corresponding disease name of candidate code c Semantic similarity, determine the semantic weight of each character in the corresponding disease name of candidate code c.

Specifically, each character based on each character and target object in the corresponding disease name of candidate code c Semantic similarity determines the semantic weight of each character in the corresponding disease name of candidate code c, comprising:

1) each word in the semantic vector and target object of each character is determined in the corresponding disease name of candidate code c The semantic vector of symbol.

Determine the semantic vector of each character in the corresponding disease name of candidate code c process include: first will be candidate It encodes the corresponding disease name of c and presses character cutting, obtain multiple characters；It then is vector by each character representation；Then by table Show that the vector input of each character can extract the network (such as LSTM network) of context semantic information, to obtain each word The semantic vector of symbol.It should be noted that the network (such as LSTM network) of context semantic information can be extracted to input Each character carries out the other contextual information of character level and extracts, and exports the semantic vector of each character.Determine each of target object The process of the semantic vector of a character is similar, and therefore not to repeat here for the present embodiment.

2) any character in disease name corresponding for candidate code c, calculate separately the semantic vector of the character with The similarity of the semantic vector of each character of target object, using the maximum similarity in the multiple similarities being calculated as The semantic weight of the character, to obtain the semantic weight of each character in the corresponding disease name of candidate code c.

Specifically, i-th of character being calculate by the following formula in the corresponding disease name of candidate code c and target object The semantic similarity S (i, j) of j-th of character:

Wherein,For the semantic vector of i-th of character in the corresponding disease name of candidate code c,For target object The semantic vector of j-th of character.

Step S4012, semantic vector and semantic weight based on each character in the corresponding disease name of candidate code c, Determine the semantic vector of the corresponding disease name of candidate code c.

Specifically, first passing through the semantic vector and semantic weight weight of each character in the corresponding disease name of candidate code c The semantic vector of each character, then splices the semantic vector reconstructed in the corresponding disease name of structure candidate code c, Semantic vector of the vector obtained after splicing as the corresponding disease name of candidate code c.

Wherein, it can be utilized by the semantic vector and semantic weight of each character in the corresponding disease name of candidate code c Following formula reconstructs the semantic vector of each character in the corresponding disease name of candidate code c:

Wherein, w_iWithThe semantic vector of i-th of character and semantic power respectively in the corresponding disease name of candidate code c Weight,To pass through w_iWithSemantic vector reconstruct, i-th character.

It can get the semantic vector that candidate code concentrates the corresponding disease name of each coding via the above process.

Step S402: concentrating the semantic vector of the corresponding disease name of each candidate code based on candidate code, from candidate The corresponding coding of target object is determined in coded set.

Specifically, the semantic vector of the corresponding disease name of each candidate code can be concentrated by candidate code, determines and wait Select and compile the score of each candidate code in code collection, wherein it is corresponding that the score of any candidate code can characterize the candidate code The semantic similarity degree of disease name and target object；The candidate code of highest scoring is concentrated to be determined as target pair candidate code As corresponding coding.

In one possible implementation, candidate code can be concentrated to the language of the corresponding disease name of each candidate code Adopted vector inputs Multilayer Perception network MLP, obtains the score that candidate code concentrates each candidate code.

Please refer to Fig. 5 and show an example of the corresponding coding of determining target object: target object is that " nasopharynx differentiated is non- Keratosa cancer " can get 10 ICD 3 based on it for the target text statistical nature of all kinds of codings in disease code library Code, such as " J31- rhinitis chronic, nasopharyngitis and pharyngitis ", " C11- malignant tumor of nasopharynx " etc., by the institute under 3 codes of each ICD Candidate code collection is formed by 6 codes of ICD, candidate code is then based on and concentrates each candidate code and target object " nasopharynx differentiation The semantic analog information of the non-keratosa cancer of type ", reconstructs the semantic vector that candidate code concentrates each candidate code, based on candidate The semantic vector of each candidate code determines that candidate code concentrates the score of each candidate code in coded set, for example, J31.100 It is scored at 0.1257, C11.900 and is scored at 0.7329, be " nasopharynx using the candidate code of highest scoring as target object The corresponding coding of the non-keratosa cancer of differentiated ", as shown in figure 5, highest scoring is encoded to C11.900, that is, C11.900 is true It is set to " the non-keratosa cancer of nasopharynx differentiated " corresponding coding.

Disease automatic coding provided by the embodiments of the present application, can be according to the feature of character level from disease code library Coding relevant to target object (disease name or disease description i.e. to be encoded) is filtered out, to obtain candidate code collection, After obtaining candidate code collection, the corresponding coding of target object, character level can be filtered out from candidate code concentration in semantic level Scalping the complexity of subsequent dusting cover can be reduced under the premise of guaranteeing accuracy rate, the dusting cover of semantic level can guarantee from time The corresponding coding of target object is accurately filtered out in a collection of selected materials code collection.Disease automatic coding provided by the embodiments of the present application, It can determine that the corresponding coding of target object not only saves people compared to h coding's mode from disease code library automatically Power reduces coding time, and avoids influence of the subjective factors to coding accuracy, also, the application passes through character The scalping of level and the dusting cover of semantic level efficiently and accurately can determine that target object is corresponding from disease code library Coding.

The embodiment of the present application also provides a kind of disease automatic coding systems, below to disease provided by the embodiments of the present application Automatic coding system is described, and disease automatic coding system described below and above-described disease automatic coding can Correspond to each other reference.

Referring to Fig. 6, a kind of structural schematic diagram of disease automatic coding system provided by the embodiments of the present application is shown, it should Disease automatic coding system may include: to obtain module 601, coding scalping module 602 and coding dusting cover module 603.

Module 601 is obtained, for obtaining target object, the target object is disease name or disease description；

Scalping module 602 is encoded, for filtering out coding relevant to the target object from disease code library, by sieving The coding composition candidate code collection selected.

Encode dusting cover module 603, for based on the candidate code concentrate the corresponding disease name of each candidate code with The semantic relation of the target object is concentrated from the candidate code and determines the corresponding coding of the target object.

Disease automatic coding system provided by the embodiments of the present application is obtaining target object (disease name i.e. to be encoded Or disease description) after, progress scalping first filters out coding relevant to target object from disease code library, to obtain Candidate code collection, then further progress dusting cover, i.e., concentrate the corresponding disease name of each coding and target based on candidate code The semantic relation of object is concentrated from candidate code and determines the corresponding coding of target object.Via the above process it is found that the application is real The disease automatic coding system for applying example offer can determine the corresponding coding of target object from disease code library automatically, compared to H coding's mode not only saves manpower, reduces coding time, and avoids subjective factors to coding accuracy It influences, in addition, the semantic relation of each coding corresponding disease name and target object is concentrated based on candidate code, it can be accurate The corresponding coding of target object, i.e., disease automatic coding system provided by the embodiments of the present application are determined from candidate code concentration in ground It can automatically, efficiently and accurately determine the corresponding coding of target object.

In one possible implementation, the coding scalping mould in disease automatic coding system provided by the above embodiment Block 602 may include: characteristic determination module and correlative coding screening module.

In one possible implementation, the above-mentioned characteristic determination module that obtains includes: coding specification submodule, the first spy It levies and determines that submodule and second feature determine submodule.

The coding specification submodule obtains multiple codings for the coding in the disease code library to be classified Collection, the corresponding coding sorts of each coded set.

The fisrt feature determines submodule, for being based on the target object and the corresponding disease name of each coded set Claim, determine the target object for the first text statistical nature of each coded set, and/or the second text statistical nature and/ Or third text statistical nature, and/or the 4th text statistical nature；Wherein, the corresponding disease name of any coded set includes being somebody's turn to do The corresponding disease name of each coding in coded set, the target object for any coded set the first text statistical nature, Second text statistical nature, third text statistical nature, the 4th text statistical nature are respectively used to characterize in the target object Each word appear in the frequency in the corresponding disease name of the coded set, each word in the target object appears in the volume Term frequency-inverse document frequency, the target object in the document of the corresponding disease name composition of code collection is corresponding with the coded set Keyword and qualifier in the text similarity of disease name, the target object disease name corresponding with the coded set Matching degree.

In one possible implementation, above-mentioned fisrt feature determines that submodule for any coded set, is based on institute Target object disease name corresponding with the coded set is stated, determines that the target object counts the first text of the coded set When feature, be specifically used for: obtain the first word set in each word weight, wherein first word set by the second word set into Row duplicate removal processing obtains, and second word set is the phrase for obtain after word segmentation processing to the coded set corresponding disease name At set, the weight of each word is gone out in second word set by each word in first word set in first word set Existing number determines；Target word set is obtained, and determines that the target word is concentrated based on the weight of each word in first word set The weight of each word, wherein the target word set is the collection that the word obtained after word segmentation processing composition is carried out to the target object It closes；The weight that each word is concentrated by the target word determines that the target object counts the first text of the coded set Feature.

In one possible implementation, above-mentioned fisrt feature determines that submodule for any coded set, is based on institute Target object disease name corresponding with the coded set is stated, determines that the target object counts the second text of the coded set When feature, it is specifically used for: obtains the corresponding disease document of the coded set, the corresponding disease document of the coded set is by the coded set pair The disease name composition answered；Target word set is obtained, and it is corresponding to determine that each word of the target word concentration appears in the coded set Disease document term frequency-inverse document frequency, wherein the target word set be to the target object carry out word segmentation processing after The set of the word composition arrived；The word of the corresponding disease document of the coded set is appeared in by each word that the target word is concentrated Frequently-and against document frequency, determine the target object for the second text statistical nature of the coded set.

In one possible implementation, above-mentioned fisrt feature determines that submodule for any coded set, is based on institute Target object disease name corresponding with the coded set is stated, determines that the target object counts the third text of the coded set When feature, it is specifically used for: calculates separately the editing distance of target object disease name corresponding with the coded set；Pass through institute The editing distance for stating target object disease name corresponding with the coded set determines the target object for the of the coded set Three text statistical natures.

In one possible implementation, above-mentioned fisrt feature determines that submodule for any coded set, is based on institute Target object disease name corresponding with the coded set is stated, determines that the target object counts the 4th text of the coded set When feature, it is specifically used for: obtains the corresponding attributed graph of the coded set, wherein the attributed graph includes main word and attribute word, institute Stating main word is the keyword in the corresponding disease name of the coded set, and the attribute word is the qualifier of the main word；It will Main word and attribute word in target object attributed graph corresponding with the coded set are matched；Based on the target object The match condition of attributed graph corresponding with the coded set determines that the target object counts special for the 4th text of the coded set Sign.

In one possible implementation, the coding dusting cover mould in disease automatic coding system provided by the above embodiment Block 603 may include: semantic vector determining module and coding screening module.

The semantic vector determining module, for based on each corresponding disease name of candidate code and the target The semantic analog information of object determines the semantic vector of the corresponding disease name of each candidate code.

In one possible implementation, above-mentioned semantic vector determining module may include: that weight determines submodule Submodule is determined with semantic vector.

The weight determines submodule, for being based on the corresponding disease name of the candidate code for any candidate code In in each character and the target object each character semantic similarity, determine in the corresponding disease name of the candidate code The semantic weight of each character.

In one possible implementation, above-mentioned weight determines submodule, is specifically used for determining the candidate code pair In the disease name answered in the semantic vector of each character and the target object each character semantic vector；For the time Any character in the corresponding disease name of code is selected and compile, each of the semantic vector of the character and the target object is calculated separately The similarity of the semantic vector of character is weighed the maximum similarity in the multiple similarities being calculated as the semanteme of the character Weight, to obtain the semantic weight of each character in the corresponding disease name of the coding.

In one possible implementation, above-mentioned coding screening module is specifically used for through each candidate volume The semantic vector of the corresponding disease name of code, determines the score of each candidate code, and the candidate code of highest scoring is true It is set to the corresponding coding of the target object；Wherein, the score of any candidate code can characterize the corresponding disease of the candidate code Name of disease claims the semantic similarity degree with the target object.

The embodiment of the present application also provides a kind of disease autocoding equipment, compile automatically referring to Fig. 7, showing the disease The structural schematic diagram of decoding apparatus, the disease autocoding equipment may include: at least one processor 701, at least one communication Interface 702, at least one processor 703 and at least one communication bus 704；

In the embodiment of the present application, processor 701, communication interface 702, memory 703, communication bus 704 quantity be At least one, and processor 701, communication interface 702, memory 703 complete mutual communication by communication bus 704；

Processor 701 may be a central processor CPU or specific integrated circuit ASIC (Application Specific Integrated Circuit), or be arranged to implement the integrated electricity of one or more of the embodiment of the present invention Road etc.；

Memory 703 may include high speed RAM memory, it is also possible to further include nonvolatile memory (non- Volatile memory) etc., a for example, at least magnetic disk storage；

Wherein, memory is stored with program, the program that processor can call memory to store, and described program is used for:

Optionally, the refinement function of described program and extension function can refer to above description.

The embodiment of the present application also provides a kind of readable storage medium storing program for executing, which can be stored with and hold suitable for processor Capable program, described program are used for:

Finally, it is to be noted that, herein, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.

Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.

The foregoing description of the disclosed embodiments makes professional and technical personnel in the field can be realized or use the application. Various modifications to these embodiments will be readily apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the application.Therefore, the application It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest scope of cause.

Claims

1. a kind of disease automatic coding characterized by comprising

Coding relevant to the target object is filtered out from disease code library, and candidate code is formed by the coding filtered out Collection；

The semantic relation that each candidate code corresponding disease name and the target object is concentrated based on the candidate code, from The candidate code, which is concentrated, determines the corresponding coding of the target object.

2. disease automatic coding according to claim 1, which is characterized in that described to be filtered out from disease code library Coding relevant to the target object, comprising:

Based on the corresponding disease name of codings all kinds of in the target object and the disease code library, the target object is determined For the target text statistical nature of all kinds of codings, wherein the target object counts the target text that any sort encodes Feature is used to characterize the degree of correlation of such coding and the target object；

Based on the target object for the target text statistical nature of all kinds of codings, filtered out from the disease code library with The relevant coding of the target object.

3. disease automatic coding according to claim 2, which is characterized in that described to be based on the target object and institute The corresponding disease name of all kinds of codings in disease code library is stated, determines that the target object unites for the target text of all kinds of codings Count feature, comprising:

Coding in the disease code library is classified, multiple coded sets, the corresponding coding class of each coded set are obtained Not；

Based on the target object and the corresponding disease name of each coded set, determine the target object for each coded set The first text statistical nature, and/or the second text statistical nature, and/or third text statistical nature, and/or the 4th text Statistical nature；Wherein, the corresponding disease name of any coded set includes the corresponding disease name of each coding, institute in the coded set State target object for the first text statistical nature of any coded set, the second text statistical nature, third text statistical nature, Each word that 4th text statistical nature is respectively used to characterize in the target object appears in the corresponding disease name of the coded set Each word in frequency, the target object in title appears in the word in the document of the corresponding disease name composition of the coded set Frequently-inverse document frequency, the text similarity of target object disease name corresponding with the coded set, the target object and The matching degree of keyword and qualifier in the corresponding disease name of the coded set；

Based on the target object for the first text statistical nature of each coded set, and/or the second text statistical nature, And/or third text statistical nature, and/or the 4th text statistical nature, determine the target object for the mesh of all kinds of codings Mark text statistical nature.

4. disease automatic coding according to claim 3, which is characterized in that for any coded set, based on described Target object disease name corresponding with the coded set determines that the target object counts special for the first text of the coded set Sign, comprising:

Obtain the weight of each word in the first word set, wherein first word set is obtained by carrying out duplicate removal processing to the second word set It arrives, second word set is that the set of the word obtained after word segmentation processing composition, institute are carried out to the corresponding disease name of the coded set It is true to state the number that the weight of each word in the first word set occurs in second word set by each word in first word set It is fixed；

Target word set is obtained, and determines that the target word concentrates the power of each word based on the weight of each word in first word set Weight, wherein the target word set is that the set of the word obtained after word segmentation processing composition is carried out to the target object；

The weight that each word is concentrated by the target word determines that the target object counts the first text of the coded set Feature.

5. disease automatic coding according to claim 3, which is characterized in that for any coded set, based on described Target object disease name corresponding with the coded set determines that the target object counts special for the second text of the coded set Sign, comprising:

The corresponding disease document of the coded set is obtained, the corresponding disease document of the coded set is by the corresponding disease name of the coded set Composition；

Target word set is obtained, and determines that each word of the target word concentration appears in the word of the corresponding disease document of the coded set Frequently-inverse document frequency, wherein the target word set is the collection that the word obtained after word segmentation processing composition is carried out to the target object It closes；

The term frequency-inverse document frequency of the corresponding disease document of the coded set is appeared in by each word that the target word is concentrated, Determine the target object for the second text statistical nature of the coded set.

6. disease automatic coding according to claim 3, which is characterized in that for any coded set, based on described Target object disease name corresponding with the coded set determines that the target object counts special for the third text of the coded set Sign, comprising:

By the editing distance of target object disease name corresponding with the coded set, determine the target object for this The third text statistical nature of coded set.

7. disease automatic coding according to claim 3, which is characterized in that for any coded set, based on described Target object disease name corresponding with the coded set determines that the target object counts special for the 4th text of the coded set Sign, comprising:

Obtain the corresponding attributed graph of the coded set, wherein the attributed graph includes main word and attribute word, and the main word is should Keyword in the corresponding disease name of coded set, the attribute word are the qualifier of the main word；

Based on the match condition of target object attributed graph corresponding with the coded set, determine the target object for the volume 4th text statistical nature of code collection.

8. disease automatic coding described according to claim 1~any one of 7, which is characterized in that described to be based on institute The semantic relation that candidate code concentrates each candidate code corresponding disease name and the target object is stated, from the candidate volume The corresponding coding of the target object is determined in code collection, comprising:

Based on the semantic analog information of each candidate code corresponding disease name and the target object, determine described each The semantic vector of the corresponding disease name of a candidate code；

Based on the semantic vector of the corresponding disease name of each candidate code, is concentrated from the candidate code and determine the mesh Mark the corresponding coding of object.

9. disease automatic coding according to claim 8, which is characterized in that described to be based on each candidate code The semantic analog information of corresponding disease name and the target object determines the corresponding disease name of each candidate code Semantic vector, comprising:

For any candidate code:

It is similar to the semanteme of each character in the target object based on each character in the corresponding disease name of the candidate code Degree, determines the semantic weight of each character in the corresponding disease name of the candidate code；

Based on the semantic vector and semantic weight of each character in the corresponding disease name of the candidate code, determine that the coding is corresponding Disease name semantic vector；

10. disease automatic coding according to claim 9, which is characterized in that described corresponding based on the candidate code Disease name in each character and the target object each character semantic similarity, determine that the candidate code is corresponding The semantic weight of each character in disease name, comprising:

Determine in the corresponding disease name of the candidate code each word in the semantic vector and the target object of each character The semantic vector of symbol；

Any character in disease name corresponding for the candidate code calculates separately the semantic vector and the mesh of the character The similarity for marking the semantic vector of each character of object, using the maximum similarity in the multiple similarities being calculated as this The semantic weight of character, to obtain the semantic weight of each character in the corresponding disease name of the coding.

11. disease automatic coding according to claim 8, which is characterized in that described based on each candidate volume The semantic vector of the corresponding disease name of code is concentrated from the candidate code and determines the corresponding coding of the target object, comprising:

By the semantic vector of the corresponding disease name of each candidate code, the score of each candidate code is determined, Wherein, the score of any candidate code can characterize the semantic phase of the corresponding disease name of the candidate code with the target object Like degree；

12. a kind of disease automatic coding system characterized by comprising obtain module, coding scalping module, coding dusting cover mould Block；

The coding scalping module, for filtering out coding relevant to the target object from disease code library, by screening Coding out forms candidate code collection；

The coding dusting cover module, for based on the candidate code concentrate the corresponding disease name of each candidate code with it is described The semantic relation of target object is concentrated from the candidate code and determines the corresponding coding of the target object.

13. disease automatic coding system according to claim 12, which is characterized in that the coding scalping module includes: Characteristic determination module and correlative coding screening module；

The characteristic determination module, for based on the corresponding disease of codings all kinds of in the target object and the disease code library Title determines the target object for the target text statistical nature of all kinds of codings, wherein the target object is for any The target text statistical nature of class coding is used to characterize the degree of correlation of such coding and the target object；

The correlative coding screening module, for the target text statistical nature based on the target object for all kinds of codings, Coding relevant to the target object is filtered out from the disease code library.

14. disease automatic coding system according to claim 13, which is characterized in that the characteristic determination module includes: Coding specification submodule, fisrt feature determine that submodule and second feature determine submodule；

The coding specification submodule obtains multiple coded sets, often for the coding in the disease code library to be classified The corresponding coding sorts of a coded set；

The fisrt feature determines submodule, for being based on the target object and the corresponding disease name of each coded set, really The fixed target object is for the first text statistical nature of each coded set, and/or the second text statistical nature, and/or the Three text statistical natures, and/or the 4th text statistical nature；Wherein, the corresponding disease name of any coded set includes the coding Concentrate the corresponding disease name of each coding, the target object for any coded set the first text statistical nature, second Text statistical nature, third text statistical nature, the 4th text statistical nature are respectively used to characterize each in the target object A word appears in the frequency in the corresponding disease name of the coded set, each word in the target object appears in the coded set Term frequency-inverse document frequency, target object disease corresponding with the coded set in the document of corresponding disease name composition The matching of keyword and qualifier in the text similarity of title, the target object disease name corresponding with the coded set Degree；

The second feature determines submodule, for counting special based on first text of the target object for each coded set Sign, and/or the second text statistical nature, and/or third text statistical nature, and/or the 4th text statistical nature, determine described in Target text statistical nature of the target object for all kinds of codings.

15. disease automatic coding system according to claim 14, which is characterized in that the fisrt feature determines submodule For any coded set, be based on target object disease name corresponding with the coded set, determine the target object for When the first text statistical nature of the coded set, it is specifically used for:

Obtain the weight of each word in the first word set, wherein first word set is obtained by carrying out duplicate removal processing to the second word set It arrives, second word set is that the set of the word obtained after word segmentation processing composition, institute are carried out to the corresponding disease name of the coded set It is true to state the number that the weight of each word in the first word set occurs in second word set by each word in first word set It is fixed；Target word set is obtained, and determines that the target word concentrates the power of each word based on the weight of each word in first word set Weight, wherein the target word set is that the set of the word obtained after word segmentation processing composition is carried out to the target object；By described Target word concentrates the weight of each word, determines the target object for the first text statistical nature of the coded set.

16. disease automatic coding system according to claim 14, which is characterized in that the fisrt feature determines submodule For any coded set, be based on target object disease name corresponding with the coded set, determine the target object for When the second text statistical nature of the coded set, it is specifically used for:

The corresponding disease document of the coded set is obtained, the corresponding disease document of the coded set is by the corresponding disease name of the coded set Composition；Target word set is obtained, and each word for determining that the target word is concentrated appears in the corresponding disease document of the coded set Term frequency-inverse document frequency, wherein the target word set is that the word obtained after word segmentation processing composition is carried out to the target object Set；The term frequency-inverse document frequency of the corresponding disease document of the coded set is appeared in by each word that the target word is concentrated, Determine the target object for the second text statistical nature of the coded set.

17. disease automatic coding system according to claim 14, which is characterized in that the fisrt feature determines submodule For any coded set, be based on target object disease name corresponding with the coded set, determine the target object for When the third text statistical nature of the coded set, it is specifically used for:

Calculate separately the editing distance of target object disease name corresponding with the coded set；By the target object with The editing distance of the corresponding disease name of the coded set determines that the target object counts special for the third text of the coded set Sign.

18. disease automatic coding system according to claim 14, which is characterized in that the fisrt feature determines submodule For any coded set, be based on target object disease name corresponding with the coded set, determine the target object for When the 4th text statistical nature of the coded set, it is specifically used for:

Obtain the corresponding attributed graph of the coded set, wherein the attributed graph includes main word and attribute word, and the main word is should Keyword in the corresponding disease name of coded set, the attribute word are the qualifier of the main word；By the target object Main word and attribute word in attributed graph corresponding with the coded set are matched；Based on the target object and the coded set pair The match condition for the attributed graph answered determines the target object for the 4th text statistical nature of the coded set.

19. disease automatic coding system described in any one of 2~18 according to claim 1, which is characterized in that the coding Dusting cover module includes: semantic vector determining module and coding screening module；

The semantic vector determining module, for based on each corresponding disease name of candidate code and the target object Semantic analog information, determine the semantic vector of the corresponding disease name of each candidate code；

The coding screening module, for the semantic vector based on the corresponding disease name of each candidate code, from described Candidate code, which is concentrated, determines the corresponding coding of the target object.

20. disease automatic coding system according to claim 19, which is characterized in that the semantic vector determining module packet Include: weight determines that submodule and semantic vector determine submodule；

The weight determines submodule, every in the corresponding disease name of the candidate code for being based on for any candidate code The semantic similarity of each character in a character and the target object determines each in the corresponding disease name of the candidate code The semantic weight of character；

The semantic vector determines submodule, for based in the corresponding disease name of the candidate code each character it is semantic to Amount and semantic weight, determine the semantic vector of the corresponding disease name of the coding.

21. disease automatic coding system according to claim 21, which is characterized in that the weight determines submodule, tool Body is used to determine each in the semantic vector and the target object of each character in the corresponding disease name of the candidate code The semantic vector of character；Any character in disease name corresponding for the candidate code, calculates separately the semanteme of the character The similarity of the semantic vector of each character of vector and the target object, by the maximum in the multiple similarities being calculated Semantic weight of the similarity as the character, to obtain the semantic weight of each character in the corresponding disease name of the coding.

22. disease automatic coding system according to claim 19, which is characterized in that the coding screening module, specifically For the semantic vector by the corresponding disease name of each candidate code, the score of each candidate code is determined, The candidate code of highest scoring is determined as the corresponding coding of the target object；Wherein, the score of any candidate code can Characterize the semantic similarity degree of the corresponding disease name of the candidate code Yu the target object.

23. a kind of disease autocoding equipment characterized by comprising memory and processor；

The memory, for storing program；

The processor realizes the disease autocoding as described in any one of claim 1~11 for executing described program Each step of method.

24. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is processed When device executes, each step of the disease automatic coding as described in any one of claim 1~11 is realized.