CN117708339A - ICD automatic coding method based on pre-training language model - Google Patents

ICD automatic coding method based on pre-training language model Download PDF

Info

Publication number
CN117708339A
CN117708339A CN202410165651.XA CN202410165651A CN117708339A CN 117708339 A CN117708339 A CN 117708339A CN 202410165651 A CN202410165651 A CN 202410165651A CN 117708339 A CN117708339 A CN 117708339A
Authority
CN
China
Prior art keywords
model
icd
training
prediction
data set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202410165651.XA
Other languages
Chinese (zh)
Other versions
CN117708339B (en
Inventor
陈先来
黄伟斌
黄金彩
陈翔
安莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202410165651.XA priority Critical patent/CN117708339B/en
Publication of CN117708339A publication Critical patent/CN117708339A/en
Application granted granted Critical
Publication of CN117708339B publication Critical patent/CN117708339B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Abstract

The embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which belongs to the technical field of data processing and specifically comprises the following steps: constructing an ICD automatic coding data set; forming a mapping set; constructing a prefix tree, and forming an LEDT model by combining the prefix tree form; dividing the ICD automatic coding data set into a training set and a verification set; dividing clinical texts in the training set and the verification set and corresponding ICD codes thereof respectively; training the LEDT model using the seq2seq training dataset; inputting an input text in a data set to be encoded into a target model, limiting generated characters by using a prefix tree in the decoding generation process of the target model, simultaneously reserving k prediction descriptions with highest output scores by using a bundling algorithm, and finally converting the k output prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output. By the scheme of the invention, the coding efficiency, the precision and the adaptability are improved.

Description

ICD automatic coding method based on pre-training language model
Technical Field
The embodiment of the invention relates to the technical field of data processing, in particular to an ICD automatic coding method based on a pre-training language model.
Background
Currently, the coding task of the international disease classification method (International Classification of Diseases, abbreviated as ICD) refers to distributing ICD codes corresponding to the medical primitives in medical texts, and has important significance in promoting medical research practice automation, improving coding quality, reducing the influence of human errors and subjective factors, assisting diagnosis and treatment, supporting disease-related grouping (Diagnosis Related Groups, abbreviated as DRGs), realizing an intelligent medical insurance fee control system and the like.
Early encoding tasks relied primarily on manual completion. According to statistics, an average encoding expert needs more than 30 minutes to complete the encoding task of an electronic medical record, and the actual requirement of medical data under rapid growth cannot be met. Furthermore, the manual encoding task requires an expert to have sufficient background knowledge to read the medical record content carefully based on review of the relevant information, a process that is costly, inefficient and prone to error.
Therefore, an ICD automatic coding method based on a pre-training language model, which is efficient, accurate and high in adaptability, is needed.
Disclosure of Invention
In view of this, the embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which at least partially solves the problems of poor coding efficiency, accuracy and adaptability in the prior art.
The embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which comprises the following steps:
step 1, an ICD automatic coding data set is constructed according to an electronic medical record, wherein the ICD automatic coding data set comprises clinical texts and corresponding ICD codes;
step 2, obtaining the code description corresponding to the ICD code from the ICD code description library, and forming a mapping set;
step 3, word segmentation is carried out on the code description to obtain an ids sequence, a prefix tree is constructed according to the ids sequence, the input range and the visual field range of the encoder are adjusted on the basis of a pre-training model, and an LEDT model is formed by combining the prefix tree form;
step 4, dividing the ICD automatic coding data set into a training set and a verification set;
step 5, respectively segmenting clinical texts in the training set and the verification set and corresponding ICD codes thereof to obtain text sequences and corresponding ICD code sequences thereof, and obtaining corresponding code descriptions through the mapping set by the ICD code sequences to form a seq2seq training data set and a seq2seq verification data set according to the text sequences and the corresponding ICD code sequences;
step 6, training the LEDT model by using a teacher training data set by using the seq2seq training data set, updating model parameters, inputting the seq2seq verification data set into the LEDT model after each training round is finished, and recording model parameters when the loss is minimum;
and 7, selecting model parameters with minimum loss in the seq2seq verification data set to obtain a target model, inputting an input text in the data set to be encoded into the target model, and in the decoding generation process of the target model, using a prefix tree to limit generated characters, so that character strings generated by the LEDT model are subsets in code description, simultaneously using a bundling algorithm to reserve k prediction descriptions with highest output scores, and finally using a mapping set to convert the output k prediction descriptions into corresponding ICD codes as prediction output.
According to a specific implementation manner of the embodiment of the present invention, the step 3 specifically includes:
step 3.1, word segmentation is carried out on the code description, the code description is converted into an ids sequence in a pre-training language model, ids of a start symbol used in the generation process of the pre-training language model are added in front of the ids sequence, ids of an end symbol used in the generation process of the model are added at the tail of the ids sequence, and an object code ids sequence generated by the model is constructed;
step 3.2, performing the operation on the ids sequences described by all codes, and constructing a prefix tree;
and 3.3, expanding the range of input data which can be processed by the encoder in the pre-training model, setting the field of view of the attention of the encoder, and forming the LEDT model by combining the prefix tree form.
According to a specific implementation manner of the embodiment of the present invention, the step 6 specifically includes:
step 6.1, using ICD codes of the seq2seq training data set as input of an LEDT model at the current moment in each time step, using a word segmentation device to segment clinical texts and code descriptions corresponding to the ICD codes, converting the clinical texts and code descriptions into corresponding ids sequences, and using the ids sequences as input of an encoder and a decoder in the LEDT model;
step 6.2, the encoder encodes the ids sequence of the clinical text to obtain a context encoding vector corresponding to each time step of the clinical text;
step 6.3, inputting the corresponding context coding vector and ids of the code description of each time step into a decoder to obtain a prediction description, calculating a loss value between the prediction description and the code description, and updating model parameters of the LEDT model through back propagation;
at step 6.4, after each training round has ended, the seq2seq verification dataset is entered into the LEDT model, recording the model parameters at which the loss is minimal.
According to a specific implementation manner of the embodiment of the present invention, the step 6.3 specifically includes:
step 6.3.1, initializing the hidden state of the decoder in the previous time step to be the value of the context coding vector corresponding to the current time step;
step 6.3.2, using the decoder hidden state and the code description word-segmented symbol in the previous time step as decoder input to update the decoder hidden state;
step 6.3.3, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 6.3.4, calculating a loss value between the predicted description of the current time step and the code description of the next time step, updating model parameters of the LEDT model through back propagation, and returning to step 6.3.1 to execute the next time step until the prediction of all the time steps is completed, and judging that the current training round is finished;
and 6.3.5, after the current training round is completed, inputting the seq2seq verification data set into the LEDT model, and recording model parameters when the loss is minimum.
According to a specific implementation manner of the embodiment of the present invention, the step 7 specifically includes:
step 7.1, inputting an input text in a data set to be encoded into a target model, segmenting words of the input text by a word segmentation device of the target model, converting the words into an ids sequence, and encoding the ids sequence by an encoder of the target model to obtain a context vector of the input text;
step 7.2, setting the first character generated by the decoder as a start character;
step 7.3, the decoder hidden state and the prediction description of the previous time step are used as decoder input to update the decoder hidden state;
step 7.4, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 7.5, inquiring the prefix tree, and setting the symbol score probability of the predictive description which does not belong to the prefix tree to zero;
step 7.6, reserving predictive descriptions with the highest score k by using a cluster search algorithm;
and 7.7, repeating the steps 7.3 to 7.6 until all time step predictions are completed, obtaining k prediction descriptions, and converting the output k prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.
The embodiment of the invention has the beneficial effects that:
1. the pre-training language model LEDT model is used for carrying out ICD automatic coding tasks, so that the morphological transformation problem of words in the electronic medical record can be better processed.
2. The ICD coding task is converted from the multi-label classification task into the generation task by using a generation mode, the ICD coding description is effectively utilized on the basis Of efficient denoising Of the generation model, interaction between the input text and the ICD coding description is enhanced, finally, the generation process Of the LEDT model is limited by combining a prefix tree, and the problem Of unknown words (OOV) generated in the generation process is avoided.
3. The Longformer is used as a framework of a pre-training language model, and the local attention based on a window is used for replacing the global attention of the basic Transformer, so that the model achieves a balance among calculation complexity, memory overhead and model performance in ICD coding task, and the coding efficiency, accuracy and adaptability are improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic flow chart of an ICD automatic coding method based on a pre-training language model according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of a specific implementation flow of an ICD automatic coding method based on a pre-training language model according to an embodiment of the present invention;
fig. 3 is a schematic diagram of local and global attentiveness in a model of led t according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a generating process of a prefix tree combined with an led t model according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating statistics of data set lengths used in an embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in detail below with reference to the accompanying drawings.
Other advantages and effects of the present invention will become apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present invention with reference to specific examples. It will be apparent that the described embodiments are only some, but not all, embodiments of the invention. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be noted that the following embodiments and features in the embodiments may be combined with each other without conflict. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It is noted that various aspects of the embodiments are described below within the scope of the following claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the present disclosure, one skilled in the art will appreciate that one aspect described herein may be implemented independently of any other aspect, and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. In addition, such apparatus may be implemented and/or such methods practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.
It should also be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.
In addition, in the following description, specific details are provided in order to provide a thorough understanding of the examples. However, it will be understood by those skilled in the art that the aspects may be practiced without these specific details.
Currently, the following challenges and opportunities exist in the ICD auto-coding field:
(1) Multi-tag classification task: mainstream students consider the ICD automatic coding task as a multi-tag classification task, but the huge tag space of ICD codes makes it difficult for the model to accurately capture the relationship between ICD codes and input text fragments.
(2) Characteristics of medical natural language processing: ICD automatic coding task belongs to medical natural language processing field, relates to the problems of inconsistent word morphology, writing style difference and the like, medical text may contain a large number of abbreviations, oral words, paraphrasing, word polysemous, multi-word meaning and even misspelling, and the model can improve understanding ability of text context by using a large-scale universal corpus to pretrain.
(3) Challenges of long text processing: the clinical text length entered in ICD auto-coding tasks often exceeds the processing range of the underlying pre-trained language models, which have an exponentially rising computational complexity and memory overhead when processing long text due to the fully connected attention mechanisms and deep network characteristics. Thus, current ICD auto-coding tasks still tend to use convolutional neural networks (Convolutional Neural Network, CNN) and recurrent neural networks (Recurrent Neural Network, RNN) or variants thereof to process input text, with less use of pre-trained language models. However, CNN-based models have difficulty accurately capturing complex relationships between input text and labels, while RNN-based models can suffer from forgetting problems when processing long text, which leaves them room for improvement in ICD auto-coding tasks.
(4) Challenge of noise handling: the input text in an ICD auto-coding task often contains a lot of noise, which presents challenges to the coding task. The pre-training language model (such as BART) generated by the transducer-Encoder-Decoder class can inject noise into the original text in the pre-training process, and then the original text is restored by decoding by a Decoder, so that knowledge in the original text is learned. The BART model shows excellent performance in tasks such as a question-answering system, abstract generation, machine translation and the like, and the mode of generating text by autoregressive can better capture the interaction relationship between the input text and the target sentence.
The embodiment of the invention provides an ICD automatic coding method based on a pre-training language model, which can be applied to the coding process of electronic medical records in medical scenes.
Referring to fig. 1, a flowchart of an ICD automatic coding method based on a pre-training language model according to an embodiment of the present invention is shown. As shown in fig. 1 and 2, the method mainly comprises the following steps:
step 1, an ICD automatic coding data set is constructed according to an electronic medical record, wherein the ICD automatic coding data set comprises clinical texts and corresponding ICD codes;
in specific implementation, electronic medical record information can be collected, and for one treatment record i of a patient, relevant diagnosis information comprises: admission diagnosis (Admission Diagnosis):discharge diagnosis (Discharge Diagnosis):surgical Name (Procedure Name): />Preoperative diagnosis (Preoperative Diagnosis): />Post-operative diagnosis (Postoperative Diagnosis): />Order (Medical Order): />Etc. We combine the relevant diagnostic information of the ith visit into one visit Document (Document):
at the same time collectICD codes referred to in (a):
when (when)Relates to->Time->On the contrary, the->。/>Expressed as ICD code space in this task (hereinafter, unless otherwise indicated, ">"all refer to the number of elements in A). Therefore, the ICD auto-encoding task in this embodiment can be described as: for input text +.>Predicting ICD codes of all its references +.>
Wherein:
step 2, obtaining the code description corresponding to the ICD code from the ICD code description library, and forming a mapping set;
in particular implementations, a textual description of the ICD codes referred to in the dataset may be obtained from an ICD code description library. A one-to-one (ICD code, ICD code description) mapping set is constructed, which is reversible considering the one-to-one correspondence between ICD codes and code descriptions.
Step 3, word segmentation is carried out on the code description to obtain an ids sequence, a prefix tree is constructed according to the ids sequence, the input range and the visual field range of the encoder are adjusted on the basis of a pre-training model, and an LEDT model is formed by combining the prefix tree form;
on the basis of the above embodiment, the step 3 specifically includes:
step 3.1, word segmentation is carried out on the code description, the code description is converted into an ids sequence in a pre-training language model, ids of a start symbol used in the generation process of the pre-training language model are added in front of the ids sequence, ids of an end symbol used in the generation process of the model are added at the tail of the ids sequence, and an object code ids sequence generated by the model is constructed;
step 3.2, performing the operation on the ids sequences described by all codes, and constructing a prefix tree;
and 3.3, expanding the range of input data which can be processed by the encoder in the pre-training model, setting the field of view of the attention of the encoder, and forming the LEDT model by combining the prefix tree form.
In specific implementation, all codes in the step 2 are described
The method comprises the steps of performing word segmentation and converting the word segmentation into an ids sequence in a pre-training language model, adding ids of a start symbol used in the generation process of the pre-training language model before the ids sequence, adding ids of an end symbol used in the generation process of the model at the end of the ids sequence, constructing an object code ids sequence generated by the model, performing the operation on the ids sequences described by all codes, and constructing a prefix tree Trie.
On this basis, given that checkpoints for the LEDT model in the chinese medical field have not been found at present, modifications can be made in the chinese BART pre-training model: the range of input data which can be processed by the encoder is expanded, the attention view range is set, the calculation complexity is reduced, the LEDs are formed, and finally the prefix tree forms an LED T model. As shown in FIG. 3, longformer serves as an architecture for the pre-trained model, giving global attention to the token [ CLS ], so that it can focus on all other tokens, which can focus on it as well. While other token can only focus on nearby token. However, since longformers are stacked in blocks, when the number of layers of the stacked Longformer blocks is enough, the receptive field of the top-level Longformer block only given to the token of local attention is also large enough, and this attention calculation mode makes the pre-training language model reach a balance among the cost of calculating complexity, the memory cost and the model performance in performing the ICD encoding task.
Taking ICD code description of malignant tumor secondary to gastric lymph node as an example, the word segmentation result is thatCorresponding ids sequences are
In the present pre-trained language model, the starting symbol for the model generation process is'The end symbol of the model generation process is also "">", whose ids corresponds to 2. Therefore, the ids sequence of the target code in "gastric lymph node secondary malignancy" is
Step 4, dividing the ICD automatic coding data set into a training set and a verification set;
in particular, the data set collected in step 1 may beRandomly dividing a training set Train set and a verification set Dev set according to the ratio of 9:1, and setting an index set Train set index and a Dev set index corresponding to the training set Train set and the verification set Dev set to carry out subsequent operation flow.
Step 5, respectively segmenting clinical texts in the training set and the verification set and corresponding ICD codes thereof to obtain text sequences and corresponding ICD code sequences thereof, and obtaining corresponding code descriptions through the mapping set by the ICD code sequences to form a seq2seq training data set and a seq2seq verification data set according to the text sequences and the corresponding ICD code sequences;
in particular, the data in the training set and the verification set are converted into a seq2seq training data set and a seq2seq verification data set. Taking the training set as an example, for each piece of data of the training setThe following operations are performed: for->Dividing according to ICD codes to form training data sets divided by ICD codes:
when (when)In this case, < +.>Converted into ICD code descriptions, forming the final seq2seq training data set seq2seqTrain set. In the form of data of
Step 6, training the LEDT model by using a teacher training data set by using the seq2seq training data set, updating model parameters, inputting the seq2seq verification data set into the LEDT model after each training round is finished, and recording model parameters when the loss is minimum;
on the basis of the above embodiment, the step 6 specifically includes:
step 6.1, using ICD codes of the seq2seq training data set as input of an LEDT model at the current moment in each time step, using a word segmentation device to segment clinical texts and code descriptions corresponding to the ICD codes, converting the clinical texts and code descriptions into corresponding ids sequences, and using the ids sequences as input of an encoder and a decoder in the LEDT model;
step 6.2, the encoder encodes the ids sequence of the clinical text to obtain a context encoding vector corresponding to each time step of the clinical text;
step 6.3, inputting the corresponding context coding vector and ids of the code description of each time step into a decoder to obtain a prediction description, calculating a loss value between the prediction description and the code description, and updating model parameters of the LEDT model through back propagation;
at step 6.4, after each training round has ended, the seq2seq verification dataset is entered into the LEDT model, recording the model parameters at which the loss is minimal.
Further, the step 6.3 specifically includes:
step 6.3.1, initializing the hidden state of the decoder in the previous time step to be the value of the context coding vector corresponding to the current time step;
step 6.3.2, using the decoder hidden state and the code description word-segmented symbol in the previous time step as decoder input to update the decoder hidden state;
step 6.3.3, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 6.3.4, calculating a loss value between the predicted description of the current time step and the code description of the next time step, updating model parameters of the LEDT model through back propagation, and returning to step 6.3.1 to execute the next time step until the prediction of all the time steps is completed, and judging that the current training round is finished;
and 6.3.5, after the current training round is completed, inputting the seq2seq verification data set into the LEDT model, and recording model parameters when the loss is minimum.
In specific implementation, in the model training stage, the led t model is trained in the seq2seq mode, in this embodiment, the training is performed by using the teacher training mode, that is, in the process of training the seq2seq model, each time step does not use the output of the last time as the input of the current time, but directly uses the label of the standard answer of the training data as the input of the current time, so that the model learns the conversion capability from inputting the clinical document DOC to the ICD code description (desc) corresponding to the document. For the followingDocument +.>And ICD descriptionWord segmentation->Converts to ids and serves as input to the LED encoder, LED decoder. The LED encoder divides the document into words, converts the words into the results of the ids and encodes the results to obtain a context encoding vector about the DOC:
refers to pair (+)>) Firstly, word segmentation (++)>) And then to ids. In the decoding process, the generated tag is associated with +.>The results of the word segmentation and transformation into ids are aligned, and the training mode uses the teacher form. The hidden state of the decoder is first initialized to the value of the context:
assume thatThe result of the segmentation and conversion into ids is +.>For each time step +.>The following operations are performed:
step 6.1: updating the hidden state of the LED decoder, wherein the hidden state of the decoder at the previous moment and the sign of the standard answer (description) word are taken as the input of the decoder
Step 6.2: sending the hidden state of the decoder into a linear layer network, and calculating the output of the current time step
Step 6.3: computing descriptions using cross entropy functionsAnd->And loss between them.
When the training dataset is fed into the model using the seq2seq, step 6.3 is in the calculationAnd (3) withAfter the loss, updating parameters of the model in an error back propagation mode, and optimizing model training. After each training round has ended, the seq2seq verification dataset is fed into the model, step 6.3 is the calculationAnd->After the loss in between, the calculation model verifies the loss of the data set in the whole seq2seq and saves the model parameters when the loss is minimum.
And 7, selecting the model parameter with the smallest loss in the seq2seq verification data set to obtain a target model, inputting an input text in the data set to be coded (setting the index of the data set as a Test set index) into the target model, and limiting generated characters by using a prefix tree in the decoding generation process of the target model, so that character strings generated by the LEDT model are subsets in code description, simultaneously, retaining k prediction descriptions with the highest output score by using a clustering algorithm, and finally converting the k output prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.
On the basis of the above embodiment, the step 7 specifically includes:
step 7.1, inputting an input text in a data set to be encoded into a target model, segmenting words of the input text by a word segmentation device of the target model, converting the words into an ids sequence, and encoding the ids sequence by an encoder of the target model to obtain a context vector of the input text;
step 7.2, setting the first character generated by the decoder as a start character;
step 7.3, the decoder hidden state and the prediction description of the previous time step are used as decoder input to update the decoder hidden state;
step 7.4, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 7.5, inquiring the prefix tree, and setting the symbol score probability of the predictive description which does not belong to the prefix tree to zero;
step 7.6, reserving predictive descriptions with the highest score k by using a cluster search algorithm;
and 7.7, repeating the steps 7.3 to 7.6 until all time step predictions are completed, obtaining k prediction descriptions, and converting the output k prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.
In implementation, as shown in fig. 4, in the model loading step 6, parameters are used when the verification set loss is minimum. And in the model test and generation stage, an ICD disease code description is generated by combining an LEDT model with a prefix tree Trie, and a cluster search algorithm is combined to realize multi-label classification tasks. For the followingDocuments in the data set to be encoded are treated separately using a word splitter token +.>Converts to ids and serves as input to the encoder in the LEDT model. The encoder in the LEDT model divides the document into words and converts the words into the results of ids, and then encodes the results to obtain context encoding vectors for DOC:
initializing the hidden state of the decoder to the value of the context:
assume that the first token generated by LEDT is
For each timeIn the step of the method, the device comprises the steps of,the following operations are performed:
step 7.1: updating the decoder hidden state, the decoder hidden state and the prediction result at the previous moment being taken as decoder inputs
Step 7.2: sending the hidden state of the decoder into a Linear layer network Linear, and calculating the output of the current time step:
step 7.3: querying a prefix tree Trie to zero the scoring probability of characters not belonging to the prefix tree
Step 7.4: the most likely topk output sequences are retained using a bundle search algorithm.
After the treatments of steps 7.1-7.4 we have obtained information onTopk ICD code descriptions. Converting the generated topk ICD code descriptions into information about +.>Topk code predictions for (a)
According to the ICD automatic coding method based on the pre-training language model, the ICD automatic coding task is carried out by using the pre-training language model LEDT, so that the morphological transformation problem of words in the electronic medical record can be better processed; the ICD coding task is converted from the multi-label classification task into the generation task in a generation mode, on the basis of efficient denoising of the generation model, ICD coding description is effectively utilized, interaction between input text and ICD coding description is enhanced, and finally a prefix tree is combined to limit the generation process of the LED, so that the problem of unregistered words generated in the generation process is avoided; the Longformer is used as a framework of a pre-training language model, and the local attention based on a window is used for replacing the global attention of the basic Transformer, so that the model achieves a balance among calculation complexity, memory overhead and model performance in ICD coding task, and the coding efficiency, accuracy and adaptability are improved.
The invention will be further described with reference to a specific example.
1, data description: the data set verified by this experiment is from the CHIP2022 evaluation five clinical diagnostic encoding tasks, and the length statistics are shown in fig. 5. Given relevant diagnostic information of one visit (including admission diagnosis, preoperative diagnosis, postoperative diagnosis, discharge diagnosis), as well as surgical name, drug name, order name, it is required to give its corresponding ICD code in the vocabulary of "disease Classification and code Country clinical version 2.0". All the medical data come from the real medical data and are marked by taking the vocabulary of the national clinical version 2.0 of disease classification and code as the standard.
1.1, training set, test set, number of documents corresponding to validation set, maximum number of labels per document (i.e.Middle->The sum of the numbers of (a) of the minimum tags, and the statistics of the average tag numbers are shown in table 1:
TABLE 1
1.1, ICD code description in this example comes from the vocabulary of "disease Classification and code Country clinical version 2.0".
2, pre-training language model parameter sources and LED construction: since we have not found the pre-trained language model parameters of the disclosed LED for chinese, this embodiment will use scripting to transform on the basis of BART, which is also the seq2seq model, expanding the longest text size that it can handle and adding local attention mechanisms to it, forming an LED. The pre-trained language model in this embodiment is from a pre-set model database.
3, evaluation index: the ICD automatic coding task is a multi-label classification task, and the common evaluation index of the ICD automatic coding task is used: micro_f1, macro_f1, micro_auc, macro_auc, p@k, in this experiment, k=1, 2,5. In particular, in ablation experiments, since the generation model is taken as the framework basis, which may lead to OOV problems, we add len (OOV) as an evaluation index, len (OOV) representing the sum of the number of unregistered words generated during the generation of the model.
3.1, micro_f1 and macro_f1: the ICD encoding task is a multi-tag classification task, and micro_F1 and macro_F1 are performance indicators of the evaluation model in the multi-tag classification task. Micro_f1 considers the accuracy and recall of all tags and unifies them into one overall evaluation index. Macro_f1 calculates the average of each tag accuracy and recall. These two metrics may measure the overall classification performance of the model for all ICD tags and the classification performance of the individual tags.
3.2, micro_auc and macro_auc: AUC (Area Under the ROC Curve) is a commonly used evaluation index for measuring the classification quality of a model. In the ICD encoding task, the micro_AUC combines all predictive markers into a whole and calculates its AUC value. The macroauc calculates the true and predicted probabilities for each tag and averages the AUCs for all tags. These two indices can evaluate the model's variability between overall prediction quality and individual labels.
3.3, P@K: for ICD encoding tasks, P@K is used to evaluate the average accuracy of the model in top-k prediction. The method evaluates the ranking effect of the model in the prediction set and helps to select the prediction result with the top ranking. This is useful for ICD coding, as it can provide a reference to the ICD code most relevant to the practitioner.
Comparison with other methods:
the experiment selects advanced models in a plurality of ICD automatic coding tasks as a baseline model:
4.1, caml: and extracting information of the input text by using the CNN, and carrying out ICD automatic coding task by combining a tag attention mechanism.
4.2, laat: to alleviate the inconsistent relationship of the input text segments corresponding to different ICD codes, LAAT uses bi-directional LSTM to extract features of the input text and uses a new tag attention mechanism to associate the text segments to the corresponding ICD codes.
4.3, MVC-LDA: the model uses multi-view convolution to extract features of text from multiple angles and introduces descriptive information constraints to improve the prediction accuracy of the model.
4.4, KAICD: KAICD uses multi-scale CNN to extract input text characteristics, and based on the characteristics, ICD code description is processed by using a bidirectional GRU to establish a code description knowledge base, knowledge of the knowledge base is introduced into a prediction process of a model by combining an attention mechanism, and the model obtains excellent results in an automatic ICD coding task Chinese-English MIMINIC-III data set and a Chinese Hunan elegance data set.
4.5, LD-PLAM: the model provides a new tag attention mechanism-pseudo tag attention mechanism to carry out ICD automatic coding tasks, uses the same attention mode for similar ICD codes to reduce calculation cost and improve the prediction performance of the model, and the model obtains excellent results in both an English MIMIIC-III data set and a Chinese Hunan elegance data set in the automatic ICD coding tasks.
TABLE 2
As can be seen from table 2, the model performance of the proposed model ledt_icd is superior to the current baseline model as a whole, and the effectiveness of ICD automatic coding by the generated pre-training language model used in the present invention is verified.
5 ablation experiment
TABLE 3 Table 3
MSL 1024 (Max Source Length 1024) represents that the input text is truncated, leaving the first 1024 characters to be input into the model. As can be seen from table 3, in the same way as the Trie, msl_2048 is better than each evaluation index of msl_1024, because truncating the input text of the model will result in loss of information, and the validity of the local attention mechanism in Longformer for modeling long text is also verified. Compared to the led_icd (msl_2048) model of line 2, since the average number of labels per record in the present dataset is not large, when the model takes the generated front topk label (k when smaller) as the prediction result to calculate the F1 score, the results generated by macro_f1 and micro_f1 of the two are not much different, and the LED t_icd (msl_2048) combined with the prefix tree performs better than the model generated without the prefix tree, and besides, the model combined with the prefix tree has a prominent advantage when doing multi-label classification task compared to the model generated without the prefix tree: OOV problems do not occur.
It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof.
The foregoing is merely illustrative of the present invention, and the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the present invention. Therefore, the protection scope of the invention is subject to the protection scope of the claims.

Claims (5)

1. An ICD automatic coding method based on a pre-training language model, comprising:
step 1, an ICD automatic coding data set is constructed according to an electronic medical record, wherein the ICD automatic coding data set comprises clinical texts and corresponding ICD codes;
step 2, obtaining the code description corresponding to the ICD code from the ICD code description library, and forming a mapping set;
step 3, word segmentation is carried out on the code description to obtain an ids sequence, a prefix tree is constructed according to the ids sequence, the input range and the visual field range of the encoder are adjusted on the basis of a pre-training model, and an LEDT model is formed by combining the prefix tree form;
step 4, dividing the ICD automatic coding data set into a training set and a verification set;
step 5, respectively segmenting clinical texts in the training set and the verification set and corresponding ICD codes thereof to obtain text sequences and corresponding ICD code sequences thereof, and obtaining corresponding code descriptions through the mapping set by the ICD code sequences to form a seq2seq training data set and a seq2seq verification data set according to the text sequences and the corresponding ICD code sequences;
step 6, training the LEDT model by using a teacher training data set by using the seq2seq training data set, updating model parameters, inputting the seq2seq verification data set into the LEDT model after each training round is finished, and recording model parameters when the loss is minimum;
and 7, selecting model parameters with minimum loss in the seq2seq verification data set to obtain a target model, inputting an input text in the data set to be encoded into the target model, and in the decoding generation process of the target model, using a prefix tree to limit generated characters, so that character strings generated by the LEDT model are subsets in code description, simultaneously using a bundling algorithm to reserve k prediction descriptions with highest output scores, and finally using a mapping set to convert the output k prediction descriptions into corresponding ICD codes as prediction output.
2. The method according to claim 1, wherein the step 3 specifically comprises:
step 3.1, word segmentation is carried out on the code description, the code description is converted into an ids sequence in a pre-training language model, ids of a start symbol used in the generation process of the pre-training language model are added in front of the ids sequence, ids of an end symbol used in the generation process of the model are added at the tail of the ids sequence, and an object code ids sequence generated by the model is constructed;
step 3.2, performing the operation on the ids sequences described by all codes, and constructing a prefix tree;
and 3.3, expanding the range of input data which can be processed by the encoder in the pre-training model, setting the field of view of the attention of the encoder, and forming the LEDT model by combining the prefix tree form.
3. The method according to claim 2, wherein the step 6 specifically comprises:
step 6.1, using ICD codes of the seq2seq training data set as input of an LED T model at the current moment in each time step, using a word segmentation device to segment clinical texts and code descriptions corresponding to the ICD codes, converting the clinical texts and code descriptions into corresponding ids sequences, and using the ids sequences as input of an encoder and a decoder in the LED T model;
step 6.2, the encoder encodes the ids sequence of the clinical text to obtain a context encoding vector corresponding to each time step of the clinical text;
step 6.3, inputting the corresponding context coding vector and ids of the code description of each time step into a decoder to obtain a prediction description, calculating a loss value between the prediction description and the code description, and updating model parameters of the LEDT model through back propagation;
at step 6.4, after each training round has ended, the seq2seq verification dataset is entered into the LEDT model, recording the model parameters at which the loss is minimal.
4. A method according to claim 3, wherein said step 6.3 comprises:
step 6.3.1, initializing the hidden state of the decoder in the previous time step to be the value of the context coding vector corresponding to the current time step;
step 6.3.2, using the decoder hidden state and the code description word-segmented symbol in the previous time step as decoder input to update the decoder hidden state;
step 6.3.3, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 6.3.4, calculating a loss value between the predicted description of the current time step and the code description of the next time step, updating model parameters of the LEDT model through back propagation, and returning to step 6.3.1 to execute the next time step until the prediction of all the time steps is completed, and judging that the current training round is finished;
and 6.3.5, after the current training round is completed, inputting the seq2seq verification data set into the LEDT model, and recording model parameters when the loss is minimum.
5. The method according to claim 4, wherein the step 7 specifically includes:
step 7.1, inputting an input text in a data set to be encoded into a target model, segmenting words of the input text by a word segmentation device of the target model, converting the words into an ids sequence, and encoding the ids sequence by an encoder of the target model to obtain a context vector of the input text;
step 7.2, setting the first character generated by the decoder as a start character;
step 7.3, the decoder hidden state and the prediction description of the previous time step are used as decoder input to update the decoder hidden state;
step 7.4, sending the updated decoder hidden state into a linear layer network, and calculating the output of the current time step as a prediction description;
step 7.5, inquiring the prefix tree, and setting the symbol score probability of the predictive description which does not belong to the prefix tree to zero;
step 7.6, reserving predictive descriptions with the highest score k by using a cluster search algorithm;
and 7.7, repeating the steps 7.3 to 7.6 until all time step predictions are completed, obtaining k prediction descriptions, and converting the output k prediction descriptions into corresponding ICD codes by using a mapping set to serve as prediction output.
CN202410165651.XA 2024-02-05 2024-02-05 ICD automatic coding method based on pre-training language model Active CN117708339B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410165651.XA CN117708339B (en) 2024-02-05 2024-02-05 ICD automatic coding method based on pre-training language model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410165651.XA CN117708339B (en) 2024-02-05 2024-02-05 ICD automatic coding method based on pre-training language model

Publications (2)

Publication Number Publication Date
CN117708339A true CN117708339A (en) 2024-03-15
CN117708339B CN117708339B (en) 2024-04-23

Family

ID=90153845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410165651.XA Active CN117708339B (en) 2024-02-05 2024-02-05 ICD automatic coding method based on pre-training language model

Country Status (1)

Country Link
CN (1) CN117708339B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190108175A1 (en) * 2016-04-08 2019-04-11 Koninklijke Philips N.V. Automated contextual determination of icd code relevance for ranking and efficient consumption
CN110895580A (en) * 2019-12-12 2020-03-20 山东众阳健康科技集团有限公司 ICD operation and operation code automatic matching method based on deep learning
US20210343410A1 (en) * 2020-05-02 2021-11-04 Petuum Inc. Method to the automatic International Classification of Diseases (ICD) coding for clinical records
CN115270715A (en) * 2021-12-17 2022-11-01 郑州大学第一附属医院 Intelligent auxiliary ICD automatic coding method and system for electronic medical record

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190108175A1 (en) * 2016-04-08 2019-04-11 Koninklijke Philips N.V. Automated contextual determination of icd code relevance for ranking and efficient consumption
CN110895580A (en) * 2019-12-12 2020-03-20 山东众阳健康科技集团有限公司 ICD operation and operation code automatic matching method based on deep learning
US20210343410A1 (en) * 2020-05-02 2021-11-04 Petuum Inc. Method to the automatic International Classification of Diseases (ICD) coding for clinical records
CN115270715A (en) * 2021-12-17 2022-11-01 郑州大学第一附属医院 Intelligent auxiliary ICD automatic coding method and system for electronic medical record

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
杜逸超;徐童;马建辉;陈恩红;郑毅;刘同柱;童贵显;: "一种基于深度神经网络的临床记录ICD自动编码方法", 大数据, no. 05 *

Also Published As

Publication number Publication date
CN117708339B (en) 2024-04-23

Similar Documents

Publication Publication Date Title
CN112242187B (en) Medical scheme recommendation system and method based on knowledge graph characterization learning
CN112214995B (en) Hierarchical multitasking term embedded learning for synonym prediction
WO2021139424A1 (en) Text content quality evaluation method, apparatus and device, and storage medium
CN111897949B (en) Guided text abstract generation method based on Transformer
CN109670179B (en) Medical record text named entity identification method based on iterative expansion convolutional neural network
CN110032739B (en) Method and system for extracting named entities of Chinese electronic medical record
CN109871538A (en) A kind of Chinese electronic health record name entity recognition method
CN112256828B (en) Medical entity relation extraction method, device, computer equipment and readable storage medium
CN112364174A (en) Patient medical record similarity evaluation method and system based on knowledge graph
Zhao et al. Cross-domain image captioning via cross-modal retrieval and model adaptation
US20220129632A1 (en) Normalized processing method and apparatus of named entity, and electronic device
CN111177375B (en) Electronic document classification method and device
CN116682553A (en) Diagnosis recommendation system integrating knowledge and patient representation
CN111966810A (en) Question-answer pair ordering method for question-answer system
CN115879473A (en) Chinese medical named entity recognition method based on improved graph attention network
CN111145914A (en) Method and device for determining lung cancer clinical disease library text entity
CN113469163A (en) Medical information recording method and device based on intelligent paper pen
CN117708339B (en) ICD automatic coding method based on pre-training language model
CN110516234A (en) Chinese medicine text segmenting method, system, equipment and medium based on GRU
CN115964475A (en) Dialogue abstract generation method for medical inquiry
CN114997190A (en) Machine translation method, device, computer equipment and storage medium
CN114756679A (en) Chinese medical text entity relation combined extraction method based on conversation attention mechanism
CN111128390B (en) Text processing method based on orthopedic symptom feature selection
Feng et al. Automated generation of ICD-11 cluster codes for Precision Medical Record Classification
Huang et al. Study on structured method of Chinese MRI report of nasopharyngeal carcinoma

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant