CN117708339B - ICD automatic coding method based on pre-training language model - Google Patents
ICD automatic coding method based on pre-training language model Download PDFInfo
- Publication number
- CN117708339B CN117708339B CN202410165651.XA CN202410165651A CN117708339B CN 117708339 B CN117708339 B CN 117708339B CN 202410165651 A CN202410165651 A CN 202410165651A CN 117708339 B CN117708339 B CN 117708339B
- Authority
- CN
- China
- Prior art keywords
- model
- icd
- training
- prediction
- code
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 93
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000012795 verification Methods 0.000 claims abstract description 29
- 230000008569 process Effects 0.000 claims abstract description 26
- 238000013507 mapping Methods 0.000 claims abstract description 16
- 230000011218 segmentation Effects 0.000 claims description 19
- 239000013598 vector Substances 0.000 claims description 14
- 238000010845 search algorithm Methods 0.000 claims description 5
- 230000000007 visual effect Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 7
- 208000030990 Impulse-control disease Diseases 0.000 description 102
- 238000003745 diagnosis Methods 0.000 description 12
- 238000004364 calculation method Methods 0.000 description 8
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 description 7
- 238000011156 evaluation Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 7
- 238000013527 convolutional neural network Methods 0.000 description 6
- 201000010099 disease Diseases 0.000 description 6
- 238000002474 experimental method Methods 0.000 description 5
- 101100481876 Danio rerio pbk gene Proteins 0.000 description 4
- 101100481878 Mus musculus Pbk gene Proteins 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 208000028659 discharge Diseases 0.000 description 3
- 230000003993 interaction Effects 0.000 description 3
- 230000002980 postoperative effect Effects 0.000 description 3
- 238000010882 preoperative diagnosis Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000011282 treatment Methods 0.000 description 3
- 238000002679 ablation Methods 0.000 description 2
- 238000013103 analytical ultracentrifugation Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000006243 chemical reaction Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 230000002496 gastric effect Effects 0.000 description 2
- 210000001165 lymph node Anatomy 0.000 description 2
- 230000006740 morphological transformation Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000000306 recurrent effect Effects 0.000 description 2
- 101150109517 Camlg gene Proteins 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000002457 bidirectional effect Effects 0.000 description 1
- 201000011510 cancer Diseases 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 229940079593 drug Drugs 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 230000002441 reversible effect Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 230000000630 rising effect Effects 0.000 description 1
- 208000011571 secondary malignant neoplasm Diseases 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02A—TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
- Y02A90/00—Technologies having an indirect contribution to adaptation to climate change
- Y02A90/10—Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation
Landscapes
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which belongs to the technical field of data processing and specifically comprises the following steps: constructing an ICD automatic coding data set; forming a mapping set; constructing a prefix tree, and forming an LEDT model by combining the prefix tree form; dividing the ICD automatic coding data set into a training set and a verification set; dividing clinical texts in the training set and the verification set and corresponding ICD codes thereof respectively; training the LEDT model using the seq2seq training dataset; inputting an input text in a data set to be encoded into a target model, limiting generated characters by using a prefix tree in the decoding generation process of the target model, simultaneously reserving k prediction descriptions with highest output scores by using a bundling algorithm, and finally converting the k output prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output. By the scheme of the invention, the coding efficiency, the precision and the adaptability are improved.
Description
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to an ICD automatic coding method based on a pre-training language model.
Background
Currently, the coding task of the international disease classification method (International Classification of Diseases, abbreviated as ICD) refers to distributing ICD codes corresponding to the medical primitives in medical texts, and has important significance in promoting medical research practice automation, improving coding quality, reducing the influence of human errors and subjective factors, assisting diagnosis and treatment, supporting disease-related grouping (Diagnosis Related Groups, abbreviated as DRGs), realizing an intelligent medical insurance fee control system and the like.
Early encoding tasks relied primarily on manual completion. According to statistics, an average encoding expert needs more than 30 minutes to complete the encoding task of an electronic medical record, and the actual requirement of medical data under rapid growth cannot be met. Furthermore, the manual encoding task requires an expert to have sufficient background knowledge to read the medical record content carefully based on review of the relevant information, a process that is costly, inefficient and prone to error.
Therefore, an ICD automatic coding method based on a pre-training language model, which is efficient, accurate and high in adaptability, is needed.
Disclosure of Invention
In view of this, the embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which at least partially solves the problems of poor coding efficiency, accuracy and adaptability in the prior art.
The embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which comprises the following steps:
step 1, an ICD automatic coding data set is constructed according to an electronic medical record, wherein the ICD automatic coding data set comprises clinical texts and corresponding ICD codes;
Step 2, obtaining the code description corresponding to the ICD code from the ICD code description library, and forming a mapping set;
Step 3, word segmentation is carried out on the code description to obtain an ids sequence, a prefix tree is constructed according to the ids sequence, the input range and the visual field range of the encoder are adjusted on the basis of a pre-training model, and an LEDT model is formed by combining the prefix tree form;
Step 4, dividing the ICD automatic coding data set into a training set and a verification set;
Step 5, respectively segmenting clinical texts in the training set and the verification set and corresponding ICD codes thereof to obtain text sequences and corresponding ICD code sequences thereof, and obtaining corresponding code descriptions through the mapping set by the ICD code sequences to form a seq2seq training data set and a seq2seq verification data set according to the text sequences and the corresponding ICD code sequences;
Step 6, training the LEDT model by using a teacher training data set by using the seq2seq training data set, updating model parameters, inputting the seq2seq verification data set into the LEDT model after each training round is finished, and recording model parameters when the loss is minimum;
And 7, selecting model parameters with minimum loss in the seq2seq verification data set to obtain a target model, inputting an input text in the data set to be encoded into the target model, and in the decoding generation process of the target model, using a prefix tree to limit generated characters, so that character strings generated by the LEDT model are subsets in code description, simultaneously using a bundling algorithm to reserve k prediction descriptions with highest output scores, and finally using a mapping set to convert the output k prediction descriptions into corresponding ICD codes as prediction output.
According to a specific implementation manner of the embodiment of the present invention, the step 3 specifically includes:
Step 3.1, word segmentation is carried out on the code description, the code description is converted into an ids sequence in a pre-training language model, ids of a start symbol used in the generation process of the pre-training language model are added in front of the ids sequence, ids of an end symbol used in the generation process of the model are added at the tail of the ids sequence, and an object code ids sequence generated by the model is constructed;
Step 3.2, performing the operation on the ids sequences described by all codes, and constructing a prefix tree;
And 3.3, expanding the range of input data which can be processed by the encoder in the pre-training model, setting the field of view of the attention of the encoder, and forming the LEDT model by combining the prefix tree form.
According to a specific implementation manner of the embodiment of the present invention, the step 6 specifically includes:
Step 6.1, using ICD codes of the seq2seq training data set as input of an LEDT model at the current moment in each time step, using a word segmentation device to segment clinical texts and code descriptions corresponding to the ICD codes, converting the clinical texts and code descriptions into corresponding ids sequences, and using the ids sequences as input of an encoder and a decoder in the LEDT model;
step 6.2, the encoder encodes the ids sequence of the clinical text to obtain a context encoding vector corresponding to each time step of the clinical text;
step 6.3, inputting the corresponding context coding vector and ids of the code description of each time step into a decoder to obtain a prediction description, calculating a loss value between the prediction description and the code description, and updating model parameters of the LEDT model through back propagation;
At step 6.4, after each training round has ended, the seq2seq verification dataset is entered into the LEDT model, recording the model parameters at which the loss is minimal.
According to a specific implementation manner of the embodiment of the present invention, the step 6.3 specifically includes:
step 6.3.1, initializing the hidden state of the decoder in the previous time step to be the value of the context coding vector corresponding to the current time step;
Step 6.3.2, using the decoder hidden state and the code description word-segmented symbol in the previous time step as decoder input to update the decoder hidden state;
Step 6.3.3, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 6.3.4, calculating a loss value between the prediction description of the current time step and the code description of the next time step, updating the model parameters of the LEDT model through back propagation, and returning to step 6.3.1 to execute the next time step until the prediction of all the time steps is completed, and judging that the current training round is finished;
And 6.3.5, after the current training round is completed, inputting the seq2seq verification data set into the LEDT model, and recording model parameters when the loss is minimum.
According to a specific implementation manner of the embodiment of the present invention, the step 7 specifically includes:
Step 7.1, inputting an input text in a data set to be encoded into a target model, segmenting words of the input text by a word segmentation device of the target model, converting the words into an ids sequence, and encoding the ids sequence by an encoder of the target model to obtain a context vector of the input text;
step 7.2, setting the first character generated by the decoder as a start character;
Step 7.3, the decoder hidden state and the prediction description of the previous time step are used as decoder input to update the decoder hidden state;
Step 7.4, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 7.5, inquiring the prefix tree, and setting the symbol score probability of the predictive description which does not belong to the prefix tree to zero;
Step 7.6, reserving predictive descriptions with the highest score k by using a cluster search algorithm;
And 7.7, repeating the steps 7.3 to 7.6 until all time step predictions are completed, obtaining k prediction descriptions, and converting the output k prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.
The embodiment of the invention has the beneficial effects that:
1. the pre-training language model LEDT model is used for carrying out ICD automatic coding tasks, so that the morphological transformation problem of words in the electronic medical record can be better processed.
2. The ICD coding task is converted from the multi-label classification task into the generation task by using a generation mode, the ICD coding description is effectively utilized on the basis Of efficient denoising Of the generation model, interaction between the input text and the ICD coding description is enhanced, finally, the generation process Of the LEDT model is limited by combining a prefix tree, and the problem Of unknown words (OOV) generated in the generation process is avoided.
3. By using Longformer as the architecture of the pre-training language model, the global attention of the basic transducer is replaced by the local attention based on the window, so that the model achieves a balance among the calculation complexity, the memory overhead and the model performance in the ICD coding task, and the coding efficiency, the accuracy and the adaptability are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an ICD automatic coding method based on a pre-training language model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a specific implementation flow of an ICD automatic coding method based on a pre-training language model according to an embodiment of the present invention;
fig. 3 is a schematic diagram of local and global attentiveness in a model of led t according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a generating process of a prefix tree combined with an led t model according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating statistics of data set lengths used in an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
Currently, the following challenges and opportunities exist in the ICD auto-coding field:
(1) Multi-tag classification task: mainstream students consider the ICD automatic coding task as a multi-tag classification task, but the huge tag space of ICD codes makes it difficult for the model to accurately capture the relationship between ICD codes and input text fragments.
(2) Characteristics of medical natural language processing: ICD automatic coding task belongs to medical natural language processing field, relates to the problems of inconsistent word morphology, writing style difference and the like, medical text may contain a large number of abbreviations, oral words, paraphrasing, word polysemous, multi-word meaning and even misspelling, and the model can improve understanding ability of text context by using a large-scale universal corpus to pretrain.
(3) Challenges of long text processing: the clinical text length entered in ICD auto-coding tasks often exceeds the processing range of the underlying pre-trained language models, which have an exponentially rising computational complexity and memory overhead when processing long text due to the fully connected attention mechanisms and deep network characteristics. Thus, current ICD auto-coding tasks still tend to use convolutional neural networks (Convolutional Neural Network, CNN) and recurrent neural networks (Recurrent Neural Network, RNN) or variants thereof to process input text, with less use of pre-trained language models. However, CNN-based models have difficulty accurately capturing complex relationships between input text and labels, while RNN-based models can suffer from forgetting problems when processing long text, which leaves them room for improvement in ICD auto-coding tasks.
(4) Challenge of noise handling: the input text in an ICD auto-coding task often contains a lot of noise, which presents challenges to the coding task. The pre-training language model (such as BART) generated by the transducer-Encoder-Decoder can inject noise into the original text in the pre-training process, and then the original text is decoded by the Decoder to restore the original text, so that knowledge in the original text is learned. The BART model shows excellent performance in tasks such as a question-answering system, abstract generation, machine translation and the like, and the mode of generating text by autoregressive can better capture the interaction relationship between the input text and the target sentence.
The embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which can be applied to the coding process of electronic medical records in medical scenes.
Referring to fig. 1, a flowchart of an ICD automatic coding method based on a pre-training language model according to an embodiment of the present invention is shown. As shown in fig. 1 and 2, the method mainly comprises the following steps:
step 1, an ICD automatic coding data set is constructed according to an electronic medical record, wherein the ICD automatic coding data set comprises clinical texts and corresponding ICD codes;
in specific implementation, electronic medical record information can be collected, and for one treatment record i of a patient, relevant diagnosis information comprises: admission diagnosis (Admission Diagnosis): Discharge diagnosis (DISCHARGE DIAGNOSIS): surgical Name (Procedure Name): /(I) Preoperative diagnosis (Preoperative Diagnosis): /(I)Post-operative diagnosis (Postoperative Diagnosis): /(I)Order (Medical Order): /(I)Etc. We combine the relevant diagnostic information of the ith visit into one visit Document (Document):
at the same time collect ICD codes referred to in (a):
When (when) Relate to/>Time/>On the contrary,/>。/>Expressed as ICD code space in this task (hereinafter, unless otherwise specified, "/>""All refer to the number of elements in A). Therefore, the ICD auto-encoding task in this embodiment can be described as: for input text/>Predicting ICD codes/>, all of which are involved。
Wherein:
。
Step 2, obtaining the code description corresponding to the ICD code from the ICD code description library, and forming a mapping set;
In particular implementations, a textual description of the ICD codes referred to in the dataset may be obtained from an ICD code description library. A one-to-one (ICD code, ICD code description) mapping set is constructed, which is reversible considering the one-to-one correspondence between ICD codes and code descriptions.
Step 3, word segmentation is carried out on the code description to obtain an ids sequence, a prefix tree is constructed according to the ids sequence, the input range and the visual field range of the encoder are adjusted on the basis of a pre-training model, and an LEDT model is formed by combining the prefix tree form;
on the basis of the above embodiment, the step 3 specifically includes:
Step 3.1, word segmentation is carried out on the code description, the code description is converted into an ids sequence in a pre-training language model, ids of a start symbol used in the generation process of the pre-training language model are added in front of the ids sequence, ids of an end symbol used in the generation process of the model are added at the tail of the ids sequence, and an object code ids sequence generated by the model is constructed;
Step 3.2, performing the operation on the ids sequences described by all codes, and constructing a prefix tree;
And 3.3, expanding the range of input data which can be processed by the encoder in the pre-training model, setting the field of view of the attention of the encoder, and forming the LEDT model by combining the prefix tree form.
In specific implementation, all codes in the step 2 are described
The method comprises the steps of performing word segmentation and converting the word segmentation into an ids sequence in a pre-training language model, adding ids of a start symbol used in the generation process of the pre-training language model before the ids sequence, adding ids of an end symbol used in the generation process of the model at the end of the ids sequence, constructing an object code ids sequence generated by the model, performing the operation on the ids sequences described by all codes, and constructing a prefix tree Trie.
On this basis, given that checkpoints for the LEDT model in the chinese medical field have not been found at present, modifications can be made in the chinese BART pre-training model: the range of input data which can be processed by the encoder is expanded, the attention view range is set, the calculation complexity is reduced, the LEDs are formed, and finally the prefix tree forms an LED T model. As shown in FIG. 3, longformer serves as an architecture for the pre-training model, giving global attention to the token [ CLS ], so that it can focus on all other tokens, and all other tokens can focus on it as well. While other token can only focus on nearby token. However, since Longformer are stacked in blocks, when the number of layers of the stacked Longformer blocks is enough, the receptive field of the top Longformer blocks only given to the token of local attention is also large enough, and the attention calculation mode makes the pre-training language model reach a balance among the cost of calculating complexity, the memory cost and the model performance in the ICD coding task.
Taking ICD code description of malignant tumor secondary to gastric lymph node as an example, the word segmentation result is thatCorresponding ids sequences are
In the present pre-trained language model, the starting symbol for the model generation process is'The end symbol of the model generation process is also "/>"", Whose ids corresponds to 2. Therefore, the ids sequence of the target code in "gastric lymph node secondary malignancy" is
。
Step 4, dividing the ICD automatic coding data set into a training set and a verification set;
in particular, the data set collected in step 1 may be Randomly dividing a training set Train set and a verification set Dev set according to the ratio of 9:1, and setting an index set Train set index and a Dev set index corresponding to the training set Train set and the verification set Dev set to carry out subsequent operation flow.
Step 5, respectively segmenting clinical texts in the training set and the verification set and corresponding ICD codes thereof to obtain text sequences and corresponding ICD code sequences thereof, and obtaining corresponding code descriptions through the mapping set by the ICD code sequences to form a seq2seq training data set and a seq2seq verification data set according to the text sequences and the corresponding ICD code sequences;
In particular, the data in the training set and the verification set are converted into a seq2seq training data set and a seq2seq verification data set. Taking the training set as an example, for each piece of data of the training set ,The following operations are performed: pair/>Dividing according to ICD codes to form training data sets divided by ICD codes:
When (when) At that time, by (ICD code, ICD code description) will/>Converted to ICD code descriptions, forming the final seq2seq training data set seq2SEQTRAIN SET. In the form of data of
。
Step 6, training the LEDT model by using a teacher training data set by using the seq2seq training data set, updating model parameters, inputting the seq2seq verification data set into the LEDT model after each training round is finished, and recording model parameters when the loss is minimum;
on the basis of the above embodiment, the step 6 specifically includes:
Step 6.1, using ICD codes of the seq2seq training data set as input of an LEDT model at the current moment in each time step, using a word segmentation device to segment clinical texts and code descriptions corresponding to the ICD codes, converting the clinical texts and code descriptions into corresponding ids sequences, and using the ids sequences as input of an encoder and a decoder in the LEDT model;
step 6.2, the encoder encodes the ids sequence of the clinical text to obtain a context encoding vector corresponding to each time step of the clinical text;
step 6.3, inputting the corresponding context coding vector and ids of the code description of each time step into a decoder to obtain a prediction description, calculating a loss value between the prediction description and the code description, and updating model parameters of the LEDT model through back propagation;
At step 6.4, after each training round has ended, the seq2seq verification dataset is entered into the LEDT model, recording the model parameters at which the loss is minimal.
Further, the step 6.3 specifically includes:
step 6.3.1, initializing the hidden state of the decoder in the previous time step to be the value of the context coding vector corresponding to the current time step;
Step 6.3.2, using the decoder hidden state and the code description word-segmented symbol in the previous time step as decoder input to update the decoder hidden state;
Step 6.3.3, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 6.3.4, calculating a loss value between the prediction description of the current time step and the code description of the next time step, updating the model parameters of the LEDT model through back propagation, and returning to step 6.3.1 to execute the next time step until the prediction of all the time steps is completed, and judging that the current training round is finished;
And 6.3.5, after the current training round is completed, inputting the seq2seq verification data set into the LEDT model, and recording model parameters when the loss is minimum.
In specific implementation, in the model training stage, the led t model is trained in the seq2seq mode, in this embodiment, the training is performed by using the teacher training mode, that is, in the process of training the seq2seq model, each time step does not use the output of the last time as the input of the current time, but directly uses the label of the standard answer of the training data as the input of the current time, so that the model learns the conversion capability from inputting the clinical document DOC to the ICD code description (desc) corresponding to the document. For the followingDocument/>, using word segmenters tokenizer, respectivelyAnd ICD descriptionWord segmentation/>Converts to ids and serves as input to the LED encoder, LED decoder. The LED encoder divides the document into words, converts the words into the results of the ids and encodes the results to obtain a context encoding vector about the DOC:
refers to pair (/ >) ) Word segmentation (/ >)) And then to ids. In the decoding process, the generated label is combined with/>The results of the word segmentation and transformation into ids are aligned, and the training mode uses the teacher form. The hidden state of the decoder is first initialized to the value of the context:
Assume that The result of the segmentation and conversion into ids is/>For each time step,/>The following operations are performed:
step 6.1: updating the hidden state of the LED decoder, wherein the hidden state of the decoder at the previous moment and the sign of the standard answer (description) word are taken as the input of the decoder
Step 6.2: sending the hidden state of the decoder into a linear layer network, and calculating the output of the current time step
Step 6.3: computing descriptions using cross entropy functionsAnd/>And loss between them.
When the training dataset is fed into the model using the seq2seq, step 6.3 is in the calculationAnd (3) withAfter the loss, updating parameters of the model in an error back propagation mode, and optimizing model training. After each training round has ended, the seq2seq verification dataset is fed into the model, step 6.3 is the calculationAnd/>After the loss in between, the calculation model verifies the loss of the data set in the whole seq2seq and saves the model parameters when the loss is minimum.
And 7, selecting the model parameter with the smallest loss in the seq2seq verification data set to obtain a target model, inputting an input text in the data set to be coded (setting the index of the data set as a Test set index) into the target model, and limiting generated characters by using a prefix tree in the decoding generation process of the target model, so that character strings generated by the LEDT model are subsets in code description, simultaneously, retaining k prediction descriptions with the highest output score by using a clustering algorithm, and finally converting the k output prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.
On the basis of the above embodiment, the step 7 specifically includes:
Step 7.1, inputting an input text in a data set to be encoded into a target model, segmenting words of the input text by a word segmentation device of the target model, converting the words into an ids sequence, and encoding the ids sequence by an encoder of the target model to obtain a context vector of the input text;
step 7.2, setting the first character generated by the decoder as a start character;
Step 7.3, the decoder hidden state and the prediction description of the previous time step are used as decoder input to update the decoder hidden state;
Step 7.4, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 7.5, inquiring the prefix tree, and setting the symbol score probability of the predictive description which does not belong to the prefix tree to zero;
Step 7.6, reserving predictive descriptions with the highest score k by using a cluster search algorithm;
And 7.7, repeating the steps 7.3 to 7.6 until all time step predictions are completed, obtaining k prediction descriptions, and converting the output k prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.
In implementation, as shown in fig. 4, in the model loading step 6, parameters are used when the verification set loss is minimum. And in the model test and generation stage, an ICD disease code description is generated by combining an LEDT model with a prefix tree Trie, and a cluster search algorithm is combined to realize multi-label classification tasks. For the followingDocuments/>, in the dataset to be encoded, respectively, using a segmenter tokenizerConverts to ids and serves as input to the encoder in the LEDT model. The encoder in the LEDT model divides the document into words and converts the words into the results of ids, and then encodes the results to obtain context encoding vectors for DOC:
initializing the hidden state of the decoder to the value of the context:
assume that the first token generated by LEDT is
For each time step of the time-frame,The following operations are performed:
Step 7.1: updating the decoder hidden state, the decoder hidden state and the prediction result at the previous moment being taken as decoder inputs
Step 7.2: sending the hidden state of the decoder into a Linear layer network Linear, and calculating the output of the current time step:
step 7.3: querying a prefix tree Trie to zero the scoring probability of characters not belonging to the prefix tree
Step 7.4: the most likely topk output sequences are retained using a bundle search algorithm.
After the treatments of steps 7.1-7.4 we have obtained information onTopk ICD code descriptions. Using (ICD code, ICD code description) mapping to translate the generated topk ICD code descriptions into information about/>Topk code predictions
。
According to the ICD automatic coding method based on the pre-training language model, the ICD automatic coding task is carried out by using the pre-training language model LEDT, so that the morphological transformation problem of words in the electronic medical record can be better processed; the ICD coding task is converted from the multi-label classification task into the generation task in a generation mode, on the basis of efficient denoising of the generation model, ICD coding description is effectively utilized, interaction between input text and ICD coding description is enhanced, and finally a prefix tree is combined to limit the generation process of the LED, so that the problem of unregistered words generated in the generation process is avoided; by using Longformer as the architecture of the pre-training language model, the global attention of the basic transducer is replaced by the local attention based on the window, so that the model achieves a balance among the calculation complexity, the memory overhead and the model performance in the ICD coding task, and the coding efficiency, the accuracy and the adaptability are improved.
The invention will be further described with reference to a specific example.
1, Data description: the data set verified by this experiment is from the CHIP2022 evaluation five clinical diagnostic encoding tasks, and the length statistics are shown in fig. 5. Given relevant diagnostic information of one visit (including admission diagnosis, preoperative diagnosis, postoperative diagnosis, discharge diagnosis), as well as surgical name, drug name, order name, it is required to give its corresponding ICD code in the vocabulary of "disease Classification and code Country clinical version 2.0". All the medical data come from the real medical data and are marked by taking the vocabulary of the national clinical version 2.0 of disease classification and code as the standard.
1.1, Training set, test set, number of documents corresponding to validation set, maximum number of labels per document (i.e.Middle/>The sum of the numbers of (a) of the minimum tags, and the statistics of the average tag numbers are shown in table 1:
TABLE 1
1.1, ICD code description in this example comes from the vocabulary of "disease Classification and code Country clinical version 2.0".
2, Pre-training language model parameter sources and LED construction: since we have not found the pre-trained language model parameters of the disclosed LED for chinese, this embodiment will use scripting to transform on the basis of BART, which is also the seq2seq model, expanding the longest text size that it can handle and adding local attention mechanisms to it, forming an LED. The pre-trained language model in this embodiment is from a pre-set model database.
3, Evaluation index: the ICD automatic coding task is a multi-label classification task, and the common evaluation index of the ICD automatic coding task is used: micro_f1, macro_f1, micro_auc, macro_auc, p@k, in this experiment, k=1, 2,5. In particular, in ablation experiments, since the generation model is taken as the framework basis, which may lead to OOV problems, we add len (OOV) as an evaluation index, len (OOV) representing the sum of the number of unregistered words generated during the generation of the model.
3.1, Micro_f1 and macro_f1: the ICD encoding task is a multi-tag classification task, and micro_F1 and macro_F1 are performance indicators of the evaluation model in the multi-tag classification task. Micro_f1 considers the accuracy and recall of all tags and unifies them into one overall evaluation index. Macro_f1 calculates the average of each tag accuracy and recall. These two metrics may measure the overall classification performance of the model for all ICD tags and the classification performance of the individual tags.
3.2, Micro_auc and macro_auc: AUC (Area Under the ROC Curve) is a commonly used evaluation index for measuring the classification quality of the model. In the ICD encoding task, the micro_AUC combines all predictive markers into a whole and calculates its AUC value. The macroauc calculates the true and predicted probabilities for each tag and averages the AUCs for all tags. These two indices can evaluate the model's variability between overall prediction quality and individual labels.
3.3, P@K: for ICD encoding tasks P@K is used to evaluate the average accuracy of the model in top-k prediction. The method evaluates the ranking effect of the model in the prediction set and helps to select the prediction result with the top ranking. This is useful for ICD coding, as it can provide a reference to the ICD code most relevant to the practitioner.
Comparison with other methods:
The experiment selects advanced models in a plurality of ICD automatic coding tasks as a baseline model:
4.1, caml: and extracting information of the input text by using the CNN, and carrying out ICD automatic coding task by combining a tag attention mechanism.
4.2, Laat: to alleviate the inconsistent relationships of the input text segments corresponding to different ICD codes, LAAT uses bi-directional LSTM to extract features of the input text and uses a new tag attention mechanism to associate the text segments to the corresponding ICD codes.
4.3, MVC-LDA: the model uses multi-view convolution to extract features of text from multiple angles and introduces descriptive information constraints to improve the prediction accuracy of the model.
4.4, KAICD: KAICD extracting input text features by using a multi-scale CNN, processing ICD code description by using a bidirectional GRU on the basis to establish a code description knowledge base, and introducing knowledge of the knowledge base into a prediction process of a model by combining an attention mechanism, wherein the model obtains excellent results in an automatic ICD coding task Chinese-English MIMIMIIC-III dataset and a Chinese Hunan elegance dataset.
4.5, LD-PLAM: the model provides a new tag attention mechanism-pseudo tag attention mechanism to carry out ICD automatic coding tasks, uses the same attention mode for similar ICD codes to reduce calculation cost and improve the prediction performance of the model, and the model obtains excellent results in both an English MIMIIC-III data set and a Chinese Hunan elegance data set in the automatic ICD coding tasks.
TABLE 2
As can be seen from table 2, the model performance of the proposed model ledt_icd is superior to the current baseline model as a whole, and the effectiveness of ICD automatic coding by the generated pre-training language model used in the present invention is verified.
5 Ablation experiment
TABLE 3 Table 3
MSL 1024 (Max Source Length 1024) represents that the input text is truncated, leaving the first 1024 characters to be input into the model. As can be seen from Table 3, with the same combination of Tries, MSL_2048 is better than MSL_1024 in each evaluation index, since truncating the input text of the model would result in loss of information, and the effectiveness of the local attention mechanism in Longformer for modeling long text was also verified. Compared to the led_icd (msl_2048) model of line 2, since the average number of labels per record in the present dataset is not large, when the model takes the generated front topk labels (k is smaller) as the prediction result to calculate the F1 score, the results generated by macro_f1 and micro_f1 of the two are not much different, and the LED t_icd (msl_2048) combined with the prefix tree performs better than the model generated without the prefix tree, and besides, the model combined with the prefix tree has a prominent advantage when doing multi-label classification task compared to the model generated without the prefix tree: OOV problems do not occur.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.
Claims (3)
1. An ICD automatic coding method based on a pre-training language model, comprising:
step 1, an ICD automatic coding data set is constructed according to an electronic medical record, wherein the ICD automatic coding data set comprises clinical texts and corresponding ICD codes;
Step 2, obtaining the code description corresponding to the ICD code from the ICD code description library, and forming a mapping set;
Step 3, word segmentation is carried out on the code description to obtain an ids sequence, a prefix tree is constructed according to the ids sequence, the input range and the visual field range of the encoder are adjusted on the basis of a pre-training model, and an LEDT model is formed by combining the prefix tree form;
The step 3 specifically includes:
Step 3.1, word segmentation is carried out on the code description, the code description is converted into an ids sequence in a pre-training language model, ids of a start symbol used in the generation process of the pre-training language model are added in front of the ids sequence, ids of an end symbol used in the generation process of the model are added at the tail of the ids sequence, and an object code ids sequence generated by the model is constructed;
Step 3.2, performing the operation on the ids sequences described by all codes, and constructing a prefix tree;
Step 3.3, expanding the range of input data which can be processed by an encoder in the pre-training model, setting the visual field range of attention of the encoder, and forming an LEDT model by combining a prefix tree form;
Step 4, dividing the ICD automatic coding data set into a training set and a verification set;
Step 5, respectively segmenting clinical texts in the training set and the verification set and corresponding ICD codes thereof to obtain text sequences and corresponding ICD code sequences thereof, and obtaining corresponding code descriptions through the mapping set by the ICD code sequences to form a seq2seq training data set and a seq2seq verification data set according to the text sequences and the corresponding ICD code sequences;
Step 6, training the LEDT model by using a teacher training data set by using the seq2seq training data set, updating model parameters, inputting the seq2seq verification data set into the LEDT model after each training round is finished, and recording model parameters when the loss is minimum;
step 7, selecting the model parameter with the smallest loss in the seq2seq verification data set to obtain a target model, inputting an input text in the data set to be encoded into the target model, and in the decoding generation process of the target model, using a prefix tree to limit generated characters, so that character strings generated by the LEDT model are subsets in code description, simultaneously using a bundling algorithm to reserve k prediction descriptions with the highest output score, and finally using a mapping set to convert the output k prediction descriptions into corresponding ICD codes as prediction output;
The step 7 specifically includes:
Step 7.1, inputting an input text in a data set to be encoded into a target model, segmenting words of the input text by a word segmentation device of the target model, converting the words into an ids sequence, and encoding the ids sequence by an encoder of the target model to obtain a context vector of the input text;
step 7.2, setting the first character generated by the decoder as a start character;
Step 7.3, the decoder hidden state and the prediction description of the previous time step are used as decoder input to update the decoder hidden state;
Step 7.4, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 7.5, inquiring the prefix tree, and setting the symbol score probability of the predictive description which does not belong to the prefix tree to zero;
Step 7.6, reserving predictive descriptions with the highest score k by using a cluster search algorithm;
And 7.7, repeating the steps 7.3 to 7.6 until all time step predictions are completed, obtaining k prediction descriptions, and converting the output k prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.
2. The method according to claim 1, wherein the step 6 specifically comprises:
Step 6.1, using ICD codes of the seq2seq training data set as input of an LED T model at the current moment in each time step, using a word segmentation device to segment clinical texts and code descriptions corresponding to the ICD codes, converting the clinical texts and code descriptions into corresponding ids sequences, and using the ids sequences as input of an encoder and a decoder in the LED T model;
step 6.2, the encoder encodes the ids sequence of the clinical text to obtain a context encoding vector corresponding to each time step of the clinical text;
step 6.3, inputting the corresponding context coding vector and ids of the code description of each time step into a decoder to obtain a prediction description, calculating a loss value between the prediction description and the code description, and updating model parameters of the LEDT model through back propagation;
At step 6.4, after each training round has ended, the seq2seq verification dataset is entered into the LEDT model, recording the model parameters at which the loss is minimal.
3. The method according to claim 2, wherein the step 6.3 specifically comprises:
step 6.3.1, initializing the hidden state of the decoder in the previous time step to be the value of the context coding vector corresponding to the current time step;
Step 6.3.2, using the decoder hidden state and the code description word-segmented symbol in the previous time step as decoder input to update the decoder hidden state;
Step 6.3.3, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 6.3.4, calculating a loss value between the prediction description of the current time step and the code description of the next time step, updating the model parameters of the LEDT model through back propagation, and returning to step 6.3.1 to execute the next time step until the prediction of all the time steps is completed, and judging that the current training round is finished;
And 6.3.5, after the current training round is completed, inputting the seq2seq verification data set into the LEDT model, and recording model parameters when the loss is minimum.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410165651.XA CN117708339B (en) | 2024-02-05 | 2024-02-05 | ICD automatic coding method based on pre-training language model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410165651.XA CN117708339B (en) | 2024-02-05 | 2024-02-05 | ICD automatic coding method based on pre-training language model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117708339A CN117708339A (en) | 2024-03-15 |
CN117708339B true CN117708339B (en) | 2024-04-23 |
Family
ID=90153845
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410165651.XA Active CN117708339B (en) | 2024-02-05 | 2024-02-05 | ICD automatic coding method based on pre-training language model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117708339B (en) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110895580A (en) * | 2019-12-12 | 2020-03-20 | 山东众阳健康科技集团有限公司 | ICD operation and operation code automatic matching method based on deep learning |
CN115270715A (en) * | 2021-12-17 | 2022-11-01 | 郑州大学第一附属医院 | Intelligent auxiliary ICD automatic coding method and system for electronic medical record |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190108175A1 (en) * | 2016-04-08 | 2019-04-11 | Koninklijke Philips N.V. | Automated contextual determination of icd code relevance for ranking and efficient consumption |
US20210343410A1 (en) * | 2020-05-02 | 2021-11-04 | Petuum Inc. | Method to the automatic International Classification of Diseases (ICD) coding for clinical records |
-
2024
- 2024-02-05 CN CN202410165651.XA patent/CN117708339B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110895580A (en) * | 2019-12-12 | 2020-03-20 | 山东众阳健康科技集团有限公司 | ICD operation and operation code automatic matching method based on deep learning |
CN115270715A (en) * | 2021-12-17 | 2022-11-01 | 郑州大学第一附属医院 | Intelligent auxiliary ICD automatic coding method and system for electronic medical record |
Non-Patent Citations (2)
Title |
---|
一种基于深度神经网络的临床记录ICD自动编码方法;杜逸超;徐童;马建辉;陈恩红;郑毅;刘同柱;童贵显;;大数据(第05期);全文 * |
杜逸超 ; 徐童 ; 马建辉 ; 陈恩红 ; 郑毅 ; 刘同柱 ; 童贵显 ; .一种基于深度神经网络的临床记录ICD自动编码方法.大数据.(第05期),全文. * |
Also Published As
Publication number | Publication date |
---|---|
CN117708339A (en) | 2024-03-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112242187B (en) | Medical scheme recommendation system and method based on knowledge graph characterization learning | |
CN112214995B (en) | Hierarchical multitasking term embedded learning for synonym prediction | |
WO2021139424A1 (en) | Text content quality evaluation method, apparatus and device, and storage medium | |
CN111897949B (en) | Guided text abstract generation method based on Transformer | |
CN110032739B (en) | Method and system for extracting named entities of Chinese electronic medical record | |
CN109670179B (en) | Medical record text named entity identification method based on iterative expansion convolutional neural network | |
CN109871538A (en) | A kind of Chinese electronic health record name entity recognition method | |
CN112364174A (en) | Patient medical record similarity evaluation method and system based on knowledge graph | |
Zhao et al. | Cross-domain image captioning via cross-modal retrieval and model adaptation | |
CN112256828B (en) | Medical entity relation extraction method, device, computer equipment and readable storage medium | |
CN106980609A (en) | A kind of name entity recognition method of the condition random field of word-based vector representation | |
US20220129632A1 (en) | Normalized processing method and apparatus of named entity, and electronic device | |
CN111966810B (en) | Question-answer pair ordering method for question-answer system | |
CN116682553A (en) | Diagnosis recommendation system integrating knowledge and patient representation | |
CN111177375B (en) | Electronic document classification method and device | |
CN113657105A (en) | Medical entity extraction method, device, equipment and medium based on vocabulary enhancement | |
CN115879473A (en) | Chinese medical named entity recognition method based on improved graph attention network | |
CN111145914A (en) | Method and device for determining lung cancer clinical disease library text entity | |
CN113469163A (en) | Medical information recording method and device based on intelligent paper pen | |
CN117708339B (en) | ICD automatic coding method based on pre-training language model | |
CN110516234A (en) | Chinese medicine text segmenting method, system, equipment and medium based on GRU | |
CN115964475A (en) | Dialogue abstract generation method for medical inquiry | |
CN111128390B (en) | Text processing method based on orthopedic symptom feature selection | |
CN117854715B (en) | Intelligent diagnosis assisting system based on inquiry analysis | |
CN115658935B (en) | Personalized comment generation method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |