CN114974602A

CN114974602A - Diagnostic coding method and system based on contrast learning

Info

Publication number: CN114974602A
Application number: CN202210581884.9A
Authority: CN
Inventors: 薛付忠; 张琪; 胡锡锋; 季晓康; 陈耀祖; 张健; 李平福; 王永超
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2022-05-26
Filing date: 2022-05-26
Publication date: 2022-08-30

Abstract

The invention belongs to the technical field of medical data processing, and provides a diagnostic coding method and a diagnostic coding system based on contrast learning, wherein the method comprises the following steps: acquiring a plurality of clinical diagnosis codes and positive examples and negative examples thereof; training a contrast learning based diagnostic coding model; respectively obtaining vector representation for the pre-acquired clinical diagnosis name and the standard diagnosis name based on the model to form a diagnosis name vector representation library; obtaining the clinical diagnosis name of the code to be matched, and obtaining vector representation according to the model; and acquiring the most similar clinical diagnosis name/standard diagnosis name through the similarity between vector representations, wherein the corresponding standard code is the required standard code. The matching between the diagnostic name to be checked and the standard diagnostic name is converted into distance measurement in a public expression space, so that the matching efficiency is improved on the basis of ensuring the precision.

Description

Diagnostic coding method and system based on contrast learning

Technical Field

The invention belongs to the technical field of medical data processing, and particularly relates to a diagnostic coding method and system based on contrast learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

The diagnosis coding task is to correspond the diagnosis part in the medical record to the standard diagnosis, the diagnosis coding has extremely important functions in the aspects of medical big data mining analysis, case filing, DRGS medical insurance payment and the like, the former diagnosis coding task is often coded by hospital coders in a manual mode, the coding mode has limited efficiency, and the coding quality is poor due to different coders understanding the coding standard or different levels.

With the development of artificial intelligence, many researchers have performed ICD coding from the perspective of deep learning. The coding task is still difficult to solve by deep learning, for example, the ICD10 standard version has more than 2 ten thousand standard codes, the clinical version standard codes have more types, most of the codes are diagnosed as rare diseases, data are extremely unbalanced, if each standard code is used as one type, modeling is performed by adopting a multi-classification mode, predictive codes are deviated to most types, classification accuracy is low, and required training data are more millions. The problem of unbalanced data can be solved by using a short text matching scheme to splice clinical diagnosis and standard diagnosis to form a positive sample and a negative sample for training, but when the method is used for prediction, a sentence formed by the diagnosis to be matched and all possible standard diagnoses needs to be used as the input of a model for prediction, so that the speed is low, and the practical application is difficult.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a diagnostic coding scheme based on contrast learning, wherein codes are mapped into a certain representation space, the space distance of the same diagnosis is closer, and the space distance of different diagnoses is farther. The codes are projected to the expression space, so that the problem of most types of deflection caused by unbalanced samples is solved, diagnosis similarity is measured by calculating the vector distance of the projection space, similar diagnosis can be quickly found, cosine similarity is calculated between clinical diagnosis to be matched and all similar diagnoses including the clinical diagnosis, and errors caused by large difference between aliases of diseases can be reduced, so that the purpose is achieved, and one or more embodiments of the invention provide the following technical scheme:

a diagnostic coding method based on contrast learning comprises the following steps:

acquiring a clinical diagnosis name and a standard diagnosis name;

acquiring a positive example and a negative example of each clinical diagnosis name;

training a model based on a contrast learning framework according to the clinical diagnosis codes and positive examples and negative examples thereof;

obtaining corresponding vector representation for all clinical diagnosis names and standard diagnosis names based on the model respectively to form a diagnosis name vector representation library;

acquiring a clinical diagnosis name of a code to be checked, and obtaining corresponding vector representation according to the model;

and acquiring the most similar clinical diagnosis name/standard diagnosis name based on a diagnosis name vector representation library according to the vector representation of the clinical diagnosis name of the code to be checked, wherein the corresponding standard code is the standard code of the clinical diagnosis name of the code to be checked.

Further, for each clinical diagnostic code, obtaining positive and negative examples thereof comprises:

after the clinical diagnosis names are obtained, aiming at each clinical diagnosis name, constructing a clinical diagnosis name-standard diagnosis code-standard diagnosis name matching pair based on the standard codes;

in the plurality of matching pairs, similarity calculation is carried out on each clinical diagnosis name and each standard diagnosis name and other clinical diagnosis names, and for the clinical diagnosis name and the standard diagnosis name with the similarity higher than a set threshold value, standard codes in the matching pairs where the clinical diagnosis name and the standard diagnosis name are located are respectively obtained, and if the standard codes are the same, the matching pairs are marked as positive examples, and if the standard codes are different, the matching pairs are marked as negative examples.

Further, deriving a vector representation from the model comprises:

and inputting the clinical diagnosis names into the model, and performing mean pooling on the output to obtain vector representation.

Further, the model adopts a twin network-based contrast learning architecture, and the training method comprises the following steps:

for each clinical diagnosis name in the Batch, inputting the clinical diagnosis name, the positive example and the negative example into a model, and performing mean pooling on output to obtain corresponding vector representation of the three;

constructing a loss function based on the clinical diagnosis name and the vector representation similarity of the positive case and the vector representation similarity of the negative case:

wherein, the simp _i,i The vector representing the similarity of the ith clinical diagnosis name to the positive case represents the similarity, simni, _j the vector representing the similarity of the ith clinical diagnosis name and the counterexample represents the similarity, and tau represents a temperature parameter; n represents the number of diagnoses in the batch;

and obtaining a trained model by taking the lowest Loss value as a model termination condition.

Further, obtaining the most similar clinical diagnosis name/standard diagnosis name includes:

according to the similarity between the diagnosis names, acquiring a plurality of clinical diagnosis names or standard diagnosis names which are most similar to the clinical diagnosis name of the code to be checked from a vector representation library;

and acquiring the most similar clinical diagnosis name or standard diagnosis name from the plurality of clinical diagnosis names or standard diagnosis names according to the similarity among the vector representations.

Further, the similarity measure between vector representations employs cosine similarity.

One or more embodiments provide a contrast learning based diagnostic coding system, comprising:

the data acquisition module is used for acquiring a clinical diagnosis name and a standard diagnosis name;

the training data construction module is used for acquiring a positive example and a negative example of each clinical diagnosis name;

the model training module is used for training a code matching model based on a comparison learning framework;

the vector representation library generating module is used for obtaining corresponding vector representations of all clinical diagnosis names and standard diagnosis names in the plurality of matching pairs based on the model respectively to form a diagnosis name vector representation library;

the code to be checked name coding module is used for acquiring the clinical diagnosis name of the code to be checked and obtaining corresponding vector representation according to the model; and acquiring the most similar clinical diagnosis name/standard diagnosis name based on a diagnosis name vector representation library according to the vector representation of the clinical diagnosis name of the code to be checked, wherein the corresponding standard code is the standard code of the clinical diagnosis name of the code to be checked.

One or more embodiments provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the contrast learning based diagnostic encoding method when executing the program.

One or more embodiments provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the contrast learning-based diagnostic encoding method.

The above one or more technical solutions have the following beneficial effects:

through constructing a model, vector representation can be carried out on any clinical diagnosis name or standard diagnosis name, namely mapping to a public representation space, matching between the diagnosis name to be subjected to code matching and the standard diagnosis name is converted into distance measurement in the public representation space, the space distance of the same diagnosis is close, the space distance of different diagnoses is far, and the diagnosis similarity is measured by calculating the vector distance of the public space, so that the similar diagnosis can be quickly found; different diagnosis names can be better distinguished through vector representation, and matching accuracy is guaranteed; meanwhile, by projecting the diagnosis names to the public representation space, most kinds of deflection problems caused by unbalanced samples are solved.

By constructing a diagnosis name vector representation library, both clinical diagnosis names and standard diagnosis names are brought into the diagnosis name vector representation library to be used as comparison objects of clinical diagnosis names to be subjected to code comparison, so that errors caused by large difference among aliases of diseases can be effectively reduced; and the matching efficiency is improved by comparing the names first and then comparing the vector representation, and the standard codes of the clinical diagnosis names of the codes to be matched can be quickly found.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.

FIG. 1 is a flow diagram of a contrast learning based diagnostic coding method in one or more embodiments of the invention.

Detailed Description

It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

The embodiments and features of the embodiments of the present invention may be combined with each other without conflict.

Example one

The embodiment discloses a diagnostic coding method based on contrast learning, which comprises the following steps:

step 1: and acquiring a plurality of clinical diagnosis names, positive examples and negative examples thereof, wherein the standard diagnosis names in the positive examples are the same as the standard codes of the clinical diagnosis names, and the standard diagnosis names in the negative examples are different from the standard codes of the clinical diagnosis names.

The step 1 specifically comprises:

step 1-1: for each clinical diagnosis name, a clinical diagnosis name-standard diagnosis code-standard diagnosis name matching pair is constructed based on the standard codes.

The clinical diagnosis name is obtained based on an electronic medical record. Specifically, a plurality of electronic medical records are acquired, diagnosis data in the electronic medical records are extracted, and preprocessing is performed, wherein the preprocessing comprises the following steps: and splitting the multiple diagnoses to remove error diagnoses. It should be noted here that the more electronic medical records are acquired, the more clinical diagnosis names that can be extracted, the different expressions of the same diagnosis name can be acquired to the maximum extent, which is helpful for acquiring enough training data, so as to improve the accuracy of the subsequent comparison model.

After the clinical diagnosis names are obtained, aiming at each clinical diagnosis name, a clinical diagnosis name-standard diagnosis code-standard diagnosis name matching pair is constructed on the basis of the standard codes. In this example, the standard code is international classification standard code (ICD-10) clinical version 2.0.

And after the matching pairs are obtained, storing the matching pairs into a data table. Specifically, the data table comprises a clinical diagnosis name field, a standard diagnosis code field and a standard diagnosis name field, and each matching pair is stored in the data table as a record. In this embodiment, an Elasticsearch database is used for data storage.

Step 1-2: in a plurality of the matching pairs, for each clinical diagnosis name, similarity calculation is performed with each standard diagnosis name and other clinical diagnosis names:

and respectively acquiring the standard codes in the matching pairs of the clinical diagnosis name and the standard diagnosis name with the similarity higher than a set threshold value based on the matching pairs of the clinical diagnosis name, the standard diagnosis code and the standard diagnosis name, and recording as a positive example if the standard codes are the same, and recording as a negative example if the standard codes are different.

Specifically, in this embodiment, based on the elastic search database, in the data table storing the diagnosis matching pairs in the database, the match mode is used to search the clinical diagnosis name column and the standard diagnosis name column, traverse each clinical diagnosis name (standard diagnosis name) di, and obtain the clinical diagnosis name (standard diagnosis name) dj with high similarity to di, if the standard diagnosis code dcj corresponding to dj is the same as the standard diagnosis code dci corresponding to di, then di-dj is taken as a positive example, otherwise di-dj is taken as a negative example. For each clinical diagnosis name, 50 positive examples and 100 negative examples are taken at the maximum. And after the obtained data set is disordered, splitting the data set into a training set, a verification set and a test set according to the proportion of 0.7, 0.15 and 0.15.

Step 2: from the data set, a training model is learned based on the comparison.

Adopting a comparative learning framework based on a twin network to construct a training model, adopting a roberta-2-128 model as a basic model of the twin network, taking a pre-training result as an initial parameter of the model, and obtaining an original diagnosis part d in a training sample _i Inputting the hidden state of the last layer of output of the model into the model, wherein X belongs to R ^B×S×E Wherein B is the size of batch, S is the maximum length of the input character, E is the dimensionality of the hidden layer, and the compressed matrix X is obtained by adopting a mean pooling mode _m ，X _m ∈R ^B×E The positive and negative examples d corresponding to the original diagnosis are also taken _j Inputting the data into a model and performing pooling operation to obtain Xp _m And Xn _m 。

The specific formula of the mean pooling is as follows:

calculating cosine similarity of the mean value pooling result of the positive example, simp ═ cos _ sim (X) _m ,Xp _m )，cosp∈R ^B×B The cosine similarity of the mean pooling result is also calculated for the negative case, simn ═ cos _ sim (X) _m ,Xn _m ). For any one of the diagnoses, the corresponding positive example is the positive sample, the corresponding negative example is the negative sample, and the info is used as the loss function for optimization. The specific formula of the loss function is as follows:

wherein, the simp _i,i The vector representing the similarity of the ith clinical diagnosis name and the positive case represents simn _i,j The vector representing the ith clinical diagnosis name and the opposite example represents the similarity, tau represents the temperature parameter, and N represents the number of diagnoses in the batch.

And obtaining a trained model by taking the lowest Loss value of the verification set as a model termination condition.

In order to improve the accuracy of the model, the data set used in the model training includes, in addition to the clinical diagnosis code obtained in step 1 and its arrangement and counter example, other samples, and the sources of these samples may be samples that have been screened before, and are not limited herein.

And step 3: and inputting all clinical diagnosis names and standard diagnosis names into the model, and performing mean pooling on the output matrix to obtain a diagnosis name vector representation library.

The vector representation and its corresponding diagnostic name and standard diagnostic code are stored in the ES database.

And 4, step 4: and acquiring the clinical diagnosis name of the code to be checked, inputting the model according to the model, and performing mean pooling on the output matrix to obtain vector representation of the clinical diagnosis name of the code to be checked.

And 5: and acquiring the most similar clinical diagnosis name/standard diagnosis name based on a diagnosis name vector representation library according to the clinical diagnosis name and the vector representation of the code to be checked.

Specifically, vectors corresponding to k most similar diagnosis names to the clinical diagnosis names are retrieved from the diagnosis name field in the ES database, and cosine similarity between the vectors and the k most similar diagnosis vectors is calculated. And taking the standard code corresponding to the vector with the highest similarity as the standard code corresponding to the diagnosis.

Example two

Based on the method provided by the first embodiment, the present embodiment provides a diagnostic coding system based on contrast learning, including:

and the training data construction module is used for acquiring the positive examples and the negative examples of each clinical diagnosis name in the training data construction module.

The model training module is used for training a model based on a comparison learning framework;

the vector representation library generating module is used for respectively obtaining corresponding vector representations of all clinical diagnosis names and standard diagnosis names in the matching pairs based on the model to form a diagnosis name vector representation library;

EXAMPLE III

The embodiment aims at providing an electronic device.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor executes the program to implement the contrast learning-based diagnostic encoding method according to the first embodiment.

Example four

An object of the present embodiment is to provide a computer-readable storage medium.

A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements a contrast learning-based diagnostic coding method according to a first embodiment.

The second to fourth embodiments correspond to the first embodiment of the method, and the detailed description thereof can be found in the related description of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media containing one or more sets of instructions; it should also be understood to include any medium that is capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any of the methods of the present invention.

One or more of the above embodiments can perform vector representation on any clinical diagnosis name or standard diagnosis name by constructing the model, and simultaneously provide a common representation space, and convert direct comparison before the diagnosis name into distance measurement in the common representation space, thereby improving matching accuracy.

The clinical diagnosis names and the standard diagnosis names are all brought into a diagnosis name vector representation library to be used as comparison objects of the clinical diagnosis names of the codes to be compared, so that errors caused by large difference among the aliases of the diseases can be effectively reduced.

Those skilled in the art will appreciate that the modules or steps of the present invention described above can be implemented using general purpose computer means, or alternatively, they can be implemented using program code that is executable by computing means, such that they are stored in memory means for execution by the computing means, or they are separately fabricated into individual integrated circuit modules, or multiple modules or steps of them are fabricated into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.

Claims

1. A diagnostic coding method based on contrast learning is characterized by comprising the following steps:

acquiring a clinical diagnosis name and a standard diagnosis name;

training a model based on a contrast learning framework according to the clinical diagnosis name and a positive example and a negative example thereof;

acquiring a clinical diagnosis name of a code to be paired, and obtaining corresponding vector representation according to the model;

2. The contrast learning-based diagnostic coding method of claim 1, wherein for each clinical diagnostic code therein, obtaining positive and negative examples thereof comprises:

3. The contrast learning based diagnostic encoding method of claim 1, wherein deriving a vector representation from the model comprises:

and inputting the clinical diagnosis name, the standard diagnosis name or the clinical diagnosis name of the code to be checked into the model, and performing mean value pooling on the output to obtain vector representation.

4. The contrast learning based diagnostic coding method of claim 3, wherein the model employs a twin network based contrast learning architecture, the training method comprising:

for each clinical diagnosis name in the Batch, inputting the clinical diagnosis name into a model, and performing mean pooling on output to obtain vector representation of the clinical diagnosis name;

wherein, the simp _i,i The vector representing the similarity of the ith clinical diagnosis name and the positive case represents simn _i,j The vector representing the similarity of the ith clinical diagnosis name and the counterexample represents the similarity, and tau represents a temperature parameter; n represents the number of diagnoses in the batch;

5. The contrast learning-based diagnostic coding method of claim 1, wherein obtaining the most similar clinical diagnosis name/standard diagnosis name comprises:

6. The contrast-learning based diagnostic coding method of claim 5, wherein the similarity measure between vector representations employs cosine similarity.

7. A contrast learning based diagnostic coding system, comprising:

the model training module is used for training a model based on a contrast learning framework according to the clinical diagnosis name and the positive example and the negative example thereof;

8. The contrast learning based diagnostic coding system of claim 7, wherein the model employs a twin network based contrast learning architecture, the training method comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor, when executing the program, implements the contrast learning based diagnostic coding method of any of claims 1-6.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out a contrast learning based diagnostic coding method according to any one of claims 1 to 6.