CN114048320B

CN114048320B - Multi-label international disease classification training method based on course learning

Info

Publication number: CN114048320B
Application number: CN202210029712.0A
Authority: CN
Inventors: 王亚强; 韩旭; 郝学超; 舒红平; 朱涛
Original assignee: Chengdu University of Information Technology; West China Hospital of Sichuan University
Current assignee: Chengdu University of Information Technology; West China Hospital of Sichuan University
Priority date: 2022-01-12
Filing date: 2022-01-12
Publication date: 2022-03-29
Anticipated expiration: 2042-01-12
Also published as: CN114048320A

Abstract

The invention discloses a multi-label international disease classification training method based on course learning, which controls label distribution through three different small batch sample sampling methods when classifying and automatically coding an international disease large-scale data set. Firstly, obtaining a multi-label international disease training sample set, and dividing the multi-label international disease training sample set into a plurality of training sample subsets; the first stage training, carrying out iterative layered sampling and gradient calculation on a training sample subset and updating model parameters in a first round; performing second-stage training, namely performing iterative scrambling segmentation and gradient calculation on the training sample set and updating model parameters in a second round; and in the third stage of training, performing iterative probability sampling and gradient calculation on the training sample subset, and updating the model parameters in the third round. The invention improves the training phase of the current mainstream model, and the improved model greatly improves the model precision and generalization capability in the ICD coding multi-label classification task.

Description

Multi-label international disease classification training method based on course learning

Technical Field

The invention relates to the technical field of natural language processing, in particular to a multi-label international disease classification training method based on course learning.

Background

International Classification of Diseases (ICD) codes are important labels for electronic health records. ICD codes are characterized by the etiology, pathology, clinical manifestations and the like of a disease, and the same class of diseases are grouped into an ordered code combination. The codes are used for quantifying important statistical data, are convenient for finding patient queues with similar diagnosis, and are also used as a standardized information exchange means among hospitals, so that the codes have important value and significance.

The automatic classification of ICD codes of electronic medical records is a significant task. On one hand, the cost of a large amount of manual classification is saved by automatic classification, and on the other hand, accurate classification can effectively assist diagnosis of doctors. In recent years, machine learning-based techniques have been demonstrated to learn classification models using electronic medical records, which require multi-label classification because each electronic case often involves multiple diseases.

The automatic classification model of ICD codes requires high accuracy and generalization capability, however, the ICD codes of electronic cases in some large-scale data sets are extremely unbalanced, which results in low accuracy and poor generalization capability of the model. Taking a famous electronic medical record set MIMIC-III (a set of 55000 electronic medical records provided by a computational physiology laboratory of the Massachusetts institute of technology), the MIMIC-III has 6000 tags distributed in a long tail, and ICD coding multi-tag classification is carried out on the MIMIC-III by three mainstream deep learning models (TextCNN, TextRNN and TextRCNN), and the classification result is shown in Table 1. From the experimental results, it can be observed that the Fscore on the test set is generally low, while the Fscore on the training set is high, and the generalization capability and the accuracy of the three models are not ideal.

TABLE 1 Experimental results of three mainstream models on MIMIC-III

Disclosure of Invention

The invention aims to provide a curriculum learning-based multi-label international disease classification training method, which divides the traditional training process into three stages with different difficulty degrees and improves the precision and generalization capability of a classification model.

In order to achieve the purpose, the technical scheme of the invention is as follows:

a multi-label international disease classification training method based on course learning comprises the following steps:

acquiring a multi-label international disease training sample set, and dividing the multi-label international disease training sample set into a plurality of training sample subsets to form a training sample subset set;

carrying out layered sampling on the training sample subset set to obtain a first-stage small batch sample subset, inputting the first-stage small batch sample subset into a course learning model to carry out gradient calculation, updating parameters of the course learning model, repeating the processes of layered sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated parameters of the course learning model, and configuring to obtain a first-stage training model;

scrambling and segmenting the multi-label international disease training sample set to obtain a second-stage small-batch sample subset, inputting the second-stage small-batch sample subset into a first-stage training model to perform gradient calculation, updating parameters of the first-stage training model, repeating the scrambling and segmenting, gradient calculation and parameter updating processes until preset iteration times are reached to obtain updated second-stage training model parameters, and configuring to obtain a second-stage training model;

and performing probability sampling on the training sample subset set to obtain a third-stage small batch sample subset, inputting the third-stage small batch sample subset into a second-stage training model to perform gradient calculation, updating parameters of the second-stage training model, repeating the processes of probability sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated third-stage training model parameters, and configuring to obtain a trained course learning model.

Further, a multi-label international disease training sample set is obtained and divided into a plurality of training sample subsets to form a training sample subset set, and the method specifically comprises the following steps:

acquiring an electronic medical record, and preprocessing the electronic medical record to obtain a multi-label international disease training sample set;

counting a multi-label international disease training sample set to obtain the probability distribution of international disease labels;

dividing the multi-label international disease training sample set into a plurality of training sample subsets based on international disease labels in the multi-label international disease training sample set to form a training sample subset set.

Further, preprocessing the electronic medical record to obtain a multi-label international disease training sample set, and specifically comprises the following steps:

deleting punctuation marks, numbers, common words and meaningless fields in the electronic medical record to obtain an initial training sample set;

performing word segmentation on the initial training sample set to generate a word segmentation dictionary;

calculating TF-IDF scores of all the participles in the participle dictionary, setting a TF-IDF score threshold range, reserving the participles of which the TF-IDF scores are in the TF-IDF score threshold range, and obtaining the multi-label international disease training sample set.

Further, counting a multi-label international disease training sample set to obtain the probability distribution of the international disease labels, and specifically comprising the following steps:

counting a multi-label international disease training sample set to obtain an international disease label set of the multi-label international disease training sample set;

and calculating the distribution condition of each international disease label according to the international disease label set to obtain the probability distribution of the international disease labels.

Further, the multi-label international disease training sample set is divided into a plurality of training sample subsets according to the international disease labels, and the number of the training sample subsets is the same as that of the international disease labels.

Further, the hierarchical sampling of the training sample subset set specifically includes the following steps:

setting the size of a small-batch sample subset in the first stage as m, and randomly sampling k international disease labels from the international disease label set;

and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling m samples for each training sample subset in the k training sample subsets to form k first-stage small-batch sample subsets.

Further, the scrambling and segmenting the multi-label international disease training sample set comprises the following specific steps:

and fully shuffling the multi-label international disease training sample set, and dividing the multi-label international disease training sample set into k non-overlapping equal parts with the size of m to form k second-stage small-batch sample subsets.

Further, performing probabilistic sampling on the training sample subset set includes the specific steps of:

probability sampling k international disease labels based on the probability distribution of the international disease labels in the multi-label international disease training sample set;

and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling one sample for each of the k training sample subsets to form k third-stage small-batch sample subsets with the size of m.

Further, in one iteration, the first-stage small-batch sample subset, the second-stage small-batch sample subset, and the third-stage small-batch sample subset are input into the training model for gradient calculation and updating of the parameters of the training model in the same manner, and the method for gradient calculation and updating of the parameters of the training model specifically includes the following steps:

s1, acquiring a training model of the current stage;

s2, sequentially inputting samples in the small batch sample subset of the current stage into a training model, calculating through a loss function to obtain loss values with the same number as the samples, and averaging the loss values to obtain the loss of the iteration of the current stage;

s3, estimating gradient updating parameters based on the training model, the loss sample subset and the small batch sample subset, and updating the training model according to the gradient updating parameters;

and S4, repeating the steps S1-S3 until all the small batch sample subsets are input into the training model to perform gradient calculation and update the parameters of the training model, and ending the iteration.

Compared with the prior art, the invention has the following advantages:

the multi-label international disease classification automatic coding method based on course learning improves three current mainstream models of textCNN, textRNN and textRCNN. The label distribution of the small-batch sample subset is controlled by three different small-batch sample sampling methods, and the label distribution represents the difficulty level of the stage. For multi-label classification of ICD codes, labels with more samples are easy to learn, and labels with less samples are difficult to learn in the training process, so that a hierarchical sampling method is used in the first stage of the training process to ensure that a small batch of sample subsets contain a large number of samples with different labels, and a model learns all kinds of labels quickly in the early stage of training; in the second stage of the training process, a small-batch sample subset is obtained by using a conventional scrambling and slicing algorithm; and the third training stage uses a guided probabilistic sampling method to make the distribution of the small batch sample subset closer to that of the original data set. The improved course learning model greatly improves the model precision and generalization capability on the ICD coding multi-label classification task.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise.

FIG. 1 is a schematic flow chart of three training phases of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The examples are given for the purpose of better illustration of the invention, but the invention is not limited to the examples. Therefore, those skilled in the art should make insubstantial modifications and adaptations to the embodiments of the present invention in light of the above teachings and remain within the scope of the invention.

Example 1

The Medical Information Mart for Intensive Care (MIMIC) data set is a Medical development data set based on Intensive Care patient monitoring. The method aims to promote medical research and improve ICU decision support level. In this embodiment, the Discharge record (Discharge summary) in the text record event table (notes) in the MIMIC is used as an electronic case, and the ICD-9 code corresponding to the electronic case is predicted.

In this embodiment, data cleansing work is performed on the original electronic case. After punctuation, numbers, stop words and some meaningless fields like "advice Date" in the case are removed, the entire dataset is tokenized and a tokenization dictionary is generated. And then calculating the TF-IDF score of each participle in the dictionary, wherein the TF-IDF can evaluate the importance degree of the participle to a corpus. 10000 word segmentations with TF-IDF scores within the preset threshold range are reserved, while the word segmentations with TF-IDF scores not within the preset threshold range are deleted, and the detailed statistical information of the processed data set is shown in Table 2.

TABLE 2 detailed statistics of data set MIMIC

MIMIC contains 55177 electronic cases in total, including 6919 ICD-9 codes. The processed data set has 898 participles and 11 labels per sample on average, and all samples form a training sample set

Where n is the total number of training samples.

Statistical training sample set

Obtaining a training sample label set

And c is the total label number of the sample. Dividing a sample set D into c subsets according to the labels of the samples:

obtaining a sample subset set layered according to labels

。

There are two types of evaluation indicators for the multi-label classification task, sample-based measurements and label-based measurements, respectively. Evaluating the performance of a model using tag-based measurements, including

、

And

the calculation method is as follows:

，

，

。

the MIMIC data set is randomly divided into a training data set, a verification data set and a test data set in a ratio of 7:1: 2. Using Adam optimization program, the learning rate was set to 0.008 and the total number of iterations was 150 rounds.

Using three models of TextCNN, TextRNN, and TextRCNN, 150 rounds of ICD-encoded multi-label classification are performed on the MIMIC dataset multiple times, and the average result is taken as a baseline. And then, the three models are trained and improved by using the method of the invention, and multi-label classification of multi-group ICD codes is carried out.

In the multi-label international disease classification training method based on curriculum learning in this embodiment, the traditional training process is divided into three stages with different difficulty levels, and the flow diagrams of the three training stages are shown in fig. 1.

Training in the first stage:

setting the size of the small-batch sample subset as m, firstly randomly sampling k labels from a label set L, and then randomly sampling k labels from the label set L

Selecting k training sample subsets corresponding to k international disease labels, and respectively randomly sampling m samples for each training sample subset in the k training sample subsets to form k first small-batch sample subsets;

inputting the first small batch sample subset into a course learning model to sequentially calculate gradients, and updating model parameters;

and repeating the processes of layered sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated course learning model parameters, and configuring to obtain a first-stage training model.

And (3) training in the second stage:

collecting the samples

Fully shuffling the samples and dividing the samples into k non-overlapping equal parts with the size of m to form k second-stage small-batch sample subsets;

inputting the second small batch sample subset into the first-stage training model to sequentially calculate the gradient, and updating the parameters of the first-stage training model;

and repeating the steps of scrambling and segmenting, gradient calculating and model parameter updating until reaching the preset iteration times, and configuring the second-stage training model according to the updated model parameters obtained by the second-stage training.

And (3) training in a third stage:

a. computing a sample set

The probability distribution P of each label.

b. Firstly, under the guidance of P, k labels are probabilistically sampled, and then

K training sample subsets corresponding to k international disease labels are selected, one sample is randomly sampled for each training sample subset in the k training sample subsets, and k third-stage small-batch sample subsets with the size of m are formed.

c. And inputting the third small batch sample subset into the second-stage training model to sequentially calculate the gradient, and updating the parameters of the second-stage training model.

d. And c, repeating the steps b and c until a preset iteration number is reached, and configuring a third-stage training model according to the updated model parameters obtained in the third-stage training stage, namely the course learning model with the finally obtained parameters updated.

In the first stage, the second stage and the third stage training process, after a small batch of sample subsets are input into the model in each iteration, the steps of calculating the gradient and updating the model parameters are consistent, and the steps of calculating the gradient and updating the model parameters specifically comprise the following steps:

s1, obtaining k small batch sample subsets B through the current-stage sampling method, and forming a small batch sample subset set required by one iteration

Where k is equal to the training set size | D | divided by the mini-batch sample subset size m.

S2, obtaining the model θ of the current stage, the initial state of θ is generally randomly drawn from a distribution (e.g., a uniform distribution).

S3, mixing

M samples in the middle and small batch sample subset B are sequentially sent into a model theta and finally obtained through loss function calculationmAveraging the loss values to obtain the loss of the iteration of the current roundl。

S4, according to the model theta, the loss l and the small batch sample subset B, the gradient update Δ theta can be estimated, and the model is updated: θ = θ - μ Δ θ, where μ is the over-parameter learning rate.

S5, repeating the steps S2-S4k times until the use

And (5) completing model updating on all small batch sample subsets, and ending the iteration.

The difficulty of the three sampling methods determines the training phase in which they are located, and the simpler the method should be performed earlier in the training. In the course learning-based multi-label international disease classification training method in the embodiment, the model is expected to be contacted with the small-batch sample subset with balanced label types in the early training stage, and the model is contacted with the small-batch sample subset with the label distribution close to the training sample in the later training stage. In the embodiment, a method for measuring the class balance of the small-batch sample subset and two methods for measuring the distribution difference of the sample labels are selected to measure the difficulty level.

Standard deviation of number of various labels

Whether the distribution of the labels of the small-batch sample subset is balanced or not can be well measured, the smaller the numerical value is, the more balanced the label distribution is, and the higher the label diversity is. The calculation formula is as follows:

where c is the number of label categories,

for the number of samples containing the ith label,

is c number

Average of (d).

The K _ L distance is used to measure the closeness between two probability distributions,

representing the distance of the distribution p compared with the distribution q, the larger the numerical value is, the larger the distribution gap is, when p = q

The formula is as follows:

wherein, in the step (A),

and

are respectively provided withRepresenting the probability that the ith element occurs in probability distributions p and q. The K _ L distance has an asymmetry,

in most cases not equal to

。

The J _ S distance is an optimization of the K _ L distance, and compared with the K _ L distance, the J _ S distance has symmetry and can better measure two probability distribution distances, namely

. The formula is as follows:

the K _ L distance and the J _ S distance can well measure the distance of label distribution between the small batch sample subset and the original training data set.

The difficulty measurements for the three methods are shown in table 3.

TABLE 3 Small batch sample subset difficulty measurement

According to the results, the label class balance in the small-batch sample subset can be well kept by using the hierarchical sampling strategy, and compared with the two methods, the probability sampling method can better keep the sample label distribution of the training set. By combining two factors, courses are trained according to the sequence of the layering sampling (2), the scrambling segmentation (0) and the probability sampling (1) from easy to difficult through the difficulty measurer of the experiment.

And inputting the test data set into three course learning models of TextCNN, TextRNN and TextRCNN after the parameters are updated, and comparing the classification result with the classification result obtained by the three course learning models which are not updated.

Table 4 shows the experimental results of the three models, and the lesson learning-based multi-label international disease classification automatic encoding method improves the Fscore of the three models TextCNN, TextRNN, and TextRCNN by 33.4%, 37.4%, and 45%, respectively, and the improvement effect is very obvious. The classification method in the embodiment is proved to be capable of improving the performance of ICD coding multi-label classification of the model on large-scale data with unbalanced label distribution.

TABLE 4 Classification results of course learning model after three parameters update

Table 5 shows the generalization ability of the classification method in this example on MIMIC-III. It can be seen that the training set Fscore and the test set Fscore of the three models are very close. By comparing the results with the results in table 4, it can be proved that the classification method in the embodiment can effectively improve the generalization capability of the model.

TABLE 5 generalization ability of the classification method in this example on MIMIC-III

Table 6 shows the results of textCNN and textRCNN under the same conditions (150 runs at 7000 for minilot size) in various other sequences for the three sampling methods, all averaged over multiple experimental knots. The two models simultaneously achieved the best results in the 2-0-1 sampling mode, which is consistent with the expectation, demonstrating the correctness of the difficulty measurement method used in this example.

TABLE 6 results of course learning model under all sampling modes

While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. A multi-label international disease classification training method based on course learning is characterized by comprising the following steps:

2. The international multi-label disease classification training method according to claim 1, wherein a multi-label international disease training sample set is obtained and divided into a plurality of training sample subsets to form a training sample subset set, and the method specifically comprises the following steps:

3. The multi-label international disease classification training method according to claim 2, wherein the electronic medical record is preprocessed to obtain a multi-label international disease training sample set, and the method specifically comprises the following steps:

4. The multi-label international disease classification training method according to claim 2, wherein the method for obtaining the international disease label probability distribution by counting the multi-label international disease training sample set specifically comprises the following steps:

5. The multi-label international disease classification training method of claim 2, characterized in that: dividing the multi-label international disease training sample set into a plurality of training sample subsets according to international disease labels, wherein the number of the training sample subsets is the same as that of the international disease labels.

6. The multi-label international disease classification training method according to claim 4, wherein the hierarchical sampling of the training sample subset set specifically comprises the steps of:

7. The multi-label international disease classification training method according to claim 1, wherein the scrambling and segmentation of the multi-label international disease training sample set comprises the specific steps of:

8. The multi-label international disease classification training method according to claim 2, characterized in that the probability sampling of the training sample subset set comprises the specific steps of:

9. The multi-label international disease classification training method of claim 1, wherein in one iteration, the first-stage small-batch sample subset, the second-stage small-batch sample subset and the third-stage small-batch sample subset are input into a training model to perform gradient calculation and update training model parameters in the same way, and the gradient calculation and the update training model parameters specifically include the following steps:

s1, acquiring a training model of the current stage;