CN114048320B - Multi-label international disease classification training method based on course learning - Google Patents
Multi-label international disease classification training method based on course learning Download PDFInfo
- Publication number
- CN114048320B CN114048320B CN202210029712.0A CN202210029712A CN114048320B CN 114048320 B CN114048320 B CN 114048320B CN 202210029712 A CN202210029712 A CN 202210029712A CN 114048320 B CN114048320 B CN 114048320B
- Authority
- CN
- China
- Prior art keywords
- training
- label
- international disease
- training sample
- stage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012549 training Methods 0.000 title claims abstract description 182
- 201000010099 disease Diseases 0.000 title claims abstract description 94
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 title claims abstract description 94
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000005070 sampling Methods 0.000 claims abstract description 39
- 238000009826 distribution Methods 0.000 claims abstract description 29
- 238000004364 calculation method Methods 0.000 claims abstract description 25
- 230000011218 segmentation Effects 0.000 claims abstract description 9
- 230000008569 process Effects 0.000 claims description 13
- 238000012935 Averaging Methods 0.000 claims description 3
- 230000006870 function Effects 0.000 claims description 3
- 238000007781 pre-processing Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 description 5
- 238000012360 testing method Methods 0.000 description 4
- 238000013145 classification model Methods 0.000 description 3
- 238000003745 diagnosis Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000036541 health Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000000691 measurement method Methods 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000007170 pathology Effects 0.000 description 1
- 230000035479 physiological effects, processes and functions Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000009827 uniform distribution Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H10/00—ICT specially adapted for the handling or processing of patient-related medical or healthcare data
- G16H10/60—ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Databases & Information Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Primary Health Care (AREA)
- Public Health (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a multi-label international disease classification training method based on course learning, which controls label distribution through three different small batch sample sampling methods when classifying and automatically coding an international disease large-scale data set. Firstly, obtaining a multi-label international disease training sample set, and dividing the multi-label international disease training sample set into a plurality of training sample subsets; the first stage training, carrying out iterative layered sampling and gradient calculation on a training sample subset and updating model parameters in a first round; performing second-stage training, namely performing iterative scrambling segmentation and gradient calculation on the training sample set and updating model parameters in a second round; and in the third stage of training, performing iterative probability sampling and gradient calculation on the training sample subset, and updating the model parameters in the third round. The invention improves the training phase of the current mainstream model, and the improved model greatly improves the model precision and generalization capability in the ICD coding multi-label classification task.
Description
Technical Field
The invention relates to the technical field of natural language processing, in particular to a multi-label international disease classification training method based on course learning.
Background
International Classification of Diseases (ICD) codes are important labels for electronic health records. ICD codes are characterized by the etiology, pathology, clinical manifestations and the like of a disease, and the same class of diseases are grouped into an ordered code combination. The codes are used for quantifying important statistical data, are convenient for finding patient queues with similar diagnosis, and are also used as a standardized information exchange means among hospitals, so that the codes have important value and significance.
The automatic classification of ICD codes of electronic medical records is a significant task. On one hand, the cost of a large amount of manual classification is saved by automatic classification, and on the other hand, accurate classification can effectively assist diagnosis of doctors. In recent years, machine learning-based techniques have been demonstrated to learn classification models using electronic medical records, which require multi-label classification because each electronic case often involves multiple diseases.
The automatic classification model of ICD codes requires high accuracy and generalization capability, however, the ICD codes of electronic cases in some large-scale data sets are extremely unbalanced, which results in low accuracy and poor generalization capability of the model. Taking a famous electronic medical record set MIMIC-III (a set of 55000 electronic medical records provided by a computational physiology laboratory of the Massachusetts institute of technology), the MIMIC-III has 6000 tags distributed in a long tail, and ICD coding multi-tag classification is carried out on the MIMIC-III by three mainstream deep learning models (TextCNN, TextRNN and TextRCNN), and the classification result is shown in Table 1. From the experimental results, it can be observed that the Fscore on the test set is generally low, while the Fscore on the training set is high, and the generalization capability and the accuracy of the three models are not ideal.
TABLE 1 Experimental results of three mainstream models on MIMIC-III
Disclosure of Invention
The invention aims to provide a curriculum learning-based multi-label international disease classification training method, which divides the traditional training process into three stages with different difficulty degrees and improves the precision and generalization capability of a classification model.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a multi-label international disease classification training method based on course learning comprises the following steps:
acquiring a multi-label international disease training sample set, and dividing the multi-label international disease training sample set into a plurality of training sample subsets to form a training sample subset set;
carrying out layered sampling on the training sample subset set to obtain a first-stage small batch sample subset, inputting the first-stage small batch sample subset into a course learning model to carry out gradient calculation, updating parameters of the course learning model, repeating the processes of layered sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated parameters of the course learning model, and configuring to obtain a first-stage training model;
scrambling and segmenting the multi-label international disease training sample set to obtain a second-stage small-batch sample subset, inputting the second-stage small-batch sample subset into a first-stage training model to perform gradient calculation, updating parameters of the first-stage training model, repeating the scrambling and segmenting, gradient calculation and parameter updating processes until preset iteration times are reached to obtain updated second-stage training model parameters, and configuring to obtain a second-stage training model;
and performing probability sampling on the training sample subset set to obtain a third-stage small batch sample subset, inputting the third-stage small batch sample subset into a second-stage training model to perform gradient calculation, updating parameters of the second-stage training model, repeating the processes of probability sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated third-stage training model parameters, and configuring to obtain a trained course learning model.
Further, a multi-label international disease training sample set is obtained and divided into a plurality of training sample subsets to form a training sample subset set, and the method specifically comprises the following steps:
acquiring an electronic medical record, and preprocessing the electronic medical record to obtain a multi-label international disease training sample set;
counting a multi-label international disease training sample set to obtain the probability distribution of international disease labels;
dividing the multi-label international disease training sample set into a plurality of training sample subsets based on international disease labels in the multi-label international disease training sample set to form a training sample subset set.
Further, preprocessing the electronic medical record to obtain a multi-label international disease training sample set, and specifically comprises the following steps:
deleting punctuation marks, numbers, common words and meaningless fields in the electronic medical record to obtain an initial training sample set;
performing word segmentation on the initial training sample set to generate a word segmentation dictionary;
calculating TF-IDF scores of all the participles in the participle dictionary, setting a TF-IDF score threshold range, reserving the participles of which the TF-IDF scores are in the TF-IDF score threshold range, and obtaining the multi-label international disease training sample set.
Further, counting a multi-label international disease training sample set to obtain the probability distribution of the international disease labels, and specifically comprising the following steps:
counting a multi-label international disease training sample set to obtain an international disease label set of the multi-label international disease training sample set;
and calculating the distribution condition of each international disease label according to the international disease label set to obtain the probability distribution of the international disease labels.
Further, the multi-label international disease training sample set is divided into a plurality of training sample subsets according to the international disease labels, and the number of the training sample subsets is the same as that of the international disease labels.
Further, the hierarchical sampling of the training sample subset set specifically includes the following steps:
setting the size of a small-batch sample subset in the first stage as m, and randomly sampling k international disease labels from the international disease label set;
and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling m samples for each training sample subset in the k training sample subsets to form k first-stage small-batch sample subsets.
Further, the scrambling and segmenting the multi-label international disease training sample set comprises the following specific steps:
and fully shuffling the multi-label international disease training sample set, and dividing the multi-label international disease training sample set into k non-overlapping equal parts with the size of m to form k second-stage small-batch sample subsets.
Further, performing probabilistic sampling on the training sample subset set includes the specific steps of:
probability sampling k international disease labels based on the probability distribution of the international disease labels in the multi-label international disease training sample set;
and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling one sample for each of the k training sample subsets to form k third-stage small-batch sample subsets with the size of m.
Further, in one iteration, the first-stage small-batch sample subset, the second-stage small-batch sample subset, and the third-stage small-batch sample subset are input into the training model for gradient calculation and updating of the parameters of the training model in the same manner, and the method for gradient calculation and updating of the parameters of the training model specifically includes the following steps:
s1, acquiring a training model of the current stage;
s2, sequentially inputting samples in the small batch sample subset of the current stage into a training model, calculating through a loss function to obtain loss values with the same number as the samples, and averaging the loss values to obtain the loss of the iteration of the current stage;
s3, estimating gradient updating parameters based on the training model, the loss sample subset and the small batch sample subset, and updating the training model according to the gradient updating parameters;
and S4, repeating the steps S1-S3 until all the small batch sample subsets are input into the training model to perform gradient calculation and update the parameters of the training model, and ending the iteration.
Compared with the prior art, the invention has the following advantages:
the multi-label international disease classification automatic coding method based on course learning improves three current mainstream models of textCNN, textRNN and textRCNN. The label distribution of the small-batch sample subset is controlled by three different small-batch sample sampling methods, and the label distribution represents the difficulty level of the stage. For multi-label classification of ICD codes, labels with more samples are easy to learn, and labels with less samples are difficult to learn in the training process, so that a hierarchical sampling method is used in the first stage of the training process to ensure that a small batch of sample subsets contain a large number of samples with different labels, and a model learns all kinds of labels quickly in the early stage of training; in the second stage of the training process, a small-batch sample subset is obtained by using a conventional scrambling and slicing algorithm; and the third training stage uses a guided probabilistic sampling method to make the distribution of the small batch sample subset closer to that of the original data set. The improved course learning model greatly improves the model precision and generalization capability on the ICD coding multi-label classification task.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise.
FIG. 1 is a schematic flow chart of three training phases of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The examples are given for the purpose of better illustration of the invention, but the invention is not limited to the examples. Therefore, those skilled in the art should make insubstantial modifications and adaptations to the embodiments of the present invention in light of the above teachings and remain within the scope of the invention.
Example 1
The Medical Information Mart for Intensive Care (MIMIC) data set is a Medical development data set based on Intensive Care patient monitoring. The method aims to promote medical research and improve ICU decision support level. In this embodiment, the Discharge record (Discharge summary) in the text record event table (notes) in the MIMIC is used as an electronic case, and the ICD-9 code corresponding to the electronic case is predicted.
In this embodiment, data cleansing work is performed on the original electronic case. After punctuation, numbers, stop words and some meaningless fields like "advice Date" in the case are removed, the entire dataset is tokenized and a tokenization dictionary is generated. And then calculating the TF-IDF score of each participle in the dictionary, wherein the TF-IDF can evaluate the importance degree of the participle to a corpus. 10000 word segmentations with TF-IDF scores within the preset threshold range are reserved, while the word segmentations with TF-IDF scores not within the preset threshold range are deleted, and the detailed statistical information of the processed data set is shown in Table 2.
TABLE 2 detailed statistics of data set MIMIC
MIMIC contains 55177 electronic cases in total, including 6919 ICD-9 codes. The processed data set has 898 participles and 11 labels per sample on average, and all samples form a training sample setWhere n is the total number of training samples.
Statistical training sample setObtaining a training sample label setAnd c is the total label number of the sample. Dividing a sample set D into c subsets according to the labels of the samples:obtaining a sample subset set layered according to labels。
There are two types of evaluation indicators for the multi-label classification task, sample-based measurements and label-based measurements, respectively. Evaluating the performance of a model using tag-based measurements, including、Andthe calculation method is as follows:
the MIMIC data set is randomly divided into a training data set, a verification data set and a test data set in a ratio of 7:1: 2. Using Adam optimization program, the learning rate was set to 0.008 and the total number of iterations was 150 rounds.
Using three models of TextCNN, TextRNN, and TextRCNN, 150 rounds of ICD-encoded multi-label classification are performed on the MIMIC dataset multiple times, and the average result is taken as a baseline. And then, the three models are trained and improved by using the method of the invention, and multi-label classification of multi-group ICD codes is carried out.
In the multi-label international disease classification training method based on curriculum learning in this embodiment, the traditional training process is divided into three stages with different difficulty levels, and the flow diagrams of the three training stages are shown in fig. 1.
Training in the first stage:
setting the size of the small-batch sample subset as m, firstly randomly sampling k labels from a label set L, and then randomly sampling k labels from the label set LSelecting k training sample subsets corresponding to k international disease labels, and respectively randomly sampling m samples for each training sample subset in the k training sample subsets to form k first small-batch sample subsets;
inputting the first small batch sample subset into a course learning model to sequentially calculate gradients, and updating model parameters;
and repeating the processes of layered sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated course learning model parameters, and configuring to obtain a first-stage training model.
And (3) training in the second stage:
collecting the samplesFully shuffling the samples and dividing the samples into k non-overlapping equal parts with the size of m to form k second-stage small-batch sample subsets;
inputting the second small batch sample subset into the first-stage training model to sequentially calculate the gradient, and updating the parameters of the first-stage training model;
and repeating the steps of scrambling and segmenting, gradient calculating and model parameter updating until reaching the preset iteration times, and configuring the second-stage training model according to the updated model parameters obtained by the second-stage training.
And (3) training in a third stage:
b. Firstly, under the guidance of P, k labels are probabilistically sampled, and thenK training sample subsets corresponding to k international disease labels are selected, one sample is randomly sampled for each training sample subset in the k training sample subsets, and k third-stage small-batch sample subsets with the size of m are formed.
c. And inputting the third small batch sample subset into the second-stage training model to sequentially calculate the gradient, and updating the parameters of the second-stage training model.
d. And c, repeating the steps b and c until a preset iteration number is reached, and configuring a third-stage training model according to the updated model parameters obtained in the third-stage training stage, namely the course learning model with the finally obtained parameters updated.
In the first stage, the second stage and the third stage training process, after a small batch of sample subsets are input into the model in each iteration, the steps of calculating the gradient and updating the model parameters are consistent, and the steps of calculating the gradient and updating the model parameters specifically comprise the following steps:
s1, obtaining k small batch sample subsets B through the current-stage sampling method, and forming a small batch sample subset set required by one iterationWhere k is equal to the training set size | D | divided by the mini-batch sample subset size m.
S2, obtaining the model θ of the current stage, the initial state of θ is generally randomly drawn from a distribution (e.g., a uniform distribution).
S3, mixingM samples in the middle and small batch sample subset B are sequentially sent into a model theta and finally obtained through loss function calculationmAveraging the loss values to obtain the loss of the iteration of the current roundl。
S4, according to the model theta, the loss l and the small batch sample subset B, the gradient update Δ theta can be estimated, and the model is updated: θ = θ - μ Δ θ, where μ is the over-parameter learning rate.
S5, repeating the steps S2-S4k times until the useAnd (5) completing model updating on all small batch sample subsets, and ending the iteration.
The difficulty of the three sampling methods determines the training phase in which they are located, and the simpler the method should be performed earlier in the training. In the course learning-based multi-label international disease classification training method in the embodiment, the model is expected to be contacted with the small-batch sample subset with balanced label types in the early training stage, and the model is contacted with the small-batch sample subset with the label distribution close to the training sample in the later training stage. In the embodiment, a method for measuring the class balance of the small-batch sample subset and two methods for measuring the distribution difference of the sample labels are selected to measure the difficulty level.
Standard deviation of number of various labelsWhether the distribution of the labels of the small-batch sample subset is balanced or not can be well measured, the smaller the numerical value is, the more balanced the label distribution is, and the higher the label diversity is. The calculation formula is as follows:where c is the number of label categories,for the number of samples containing the ith label,is c numberAverage of (d).
The K _ L distance is used to measure the closeness between two probability distributions,representing the distance of the distribution p compared with the distribution q, the larger the numerical value is, the larger the distribution gap is, when p = qThe formula is as follows:wherein, in the step (A),andare respectively provided withRepresenting the probability that the ith element occurs in probability distributions p and q. The K _ L distance has an asymmetry,in most cases not equal to。
The J _ S distance is an optimization of the K _ L distance, and compared with the K _ L distance, the J _ S distance has symmetry and can better measure two probability distribution distances, namely. The formula is as follows:the K _ L distance and the J _ S distance can well measure the distance of label distribution between the small batch sample subset and the original training data set.
The difficulty measurements for the three methods are shown in table 3.
TABLE 3 Small batch sample subset difficulty measurement
According to the results, the label class balance in the small-batch sample subset can be well kept by using the hierarchical sampling strategy, and compared with the two methods, the probability sampling method can better keep the sample label distribution of the training set. By combining two factors, courses are trained according to the sequence of the layering sampling (2), the scrambling segmentation (0) and the probability sampling (1) from easy to difficult through the difficulty measurer of the experiment.
And inputting the test data set into three course learning models of TextCNN, TextRNN and TextRCNN after the parameters are updated, and comparing the classification result with the classification result obtained by the three course learning models which are not updated.
Table 4 shows the experimental results of the three models, and the lesson learning-based multi-label international disease classification automatic encoding method improves the Fscore of the three models TextCNN, TextRNN, and TextRCNN by 33.4%, 37.4%, and 45%, respectively, and the improvement effect is very obvious. The classification method in the embodiment is proved to be capable of improving the performance of ICD coding multi-label classification of the model on large-scale data with unbalanced label distribution.
TABLE 4 Classification results of course learning model after three parameters update
Table 5 shows the generalization ability of the classification method in this example on MIMIC-III. It can be seen that the training set Fscore and the test set Fscore of the three models are very close. By comparing the results with the results in table 4, it can be proved that the classification method in the embodiment can effectively improve the generalization capability of the model.
TABLE 5 generalization ability of the classification method in this example on MIMIC-III
Table 6 shows the results of textCNN and textRCNN under the same conditions (150 runs at 7000 for minilot size) in various other sequences for the three sampling methods, all averaged over multiple experimental knots. The two models simultaneously achieved the best results in the 2-0-1 sampling mode, which is consistent with the expectation, demonstrating the correctness of the difficulty measurement method used in this example.
TABLE 6 results of course learning model under all sampling modes
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.
Claims (9)
1. A multi-label international disease classification training method based on course learning is characterized by comprising the following steps:
acquiring a multi-label international disease training sample set, and dividing the multi-label international disease training sample set into a plurality of training sample subsets to form a training sample subset set;
carrying out layered sampling on the training sample subset set to obtain a first-stage small batch sample subset, inputting the first-stage small batch sample subset into a course learning model to carry out gradient calculation, updating parameters of the course learning model, repeating the processes of layered sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated parameters of the course learning model, and configuring to obtain a first-stage training model;
scrambling and segmenting the multi-label international disease training sample set to obtain a second-stage small-batch sample subset, inputting the second-stage small-batch sample subset into a first-stage training model to perform gradient calculation, updating parameters of the first-stage training model, repeating the scrambling and segmenting, gradient calculation and parameter updating processes until preset iteration times are reached to obtain updated second-stage training model parameters, and configuring to obtain a second-stage training model;
and performing probability sampling on the training sample subset set to obtain a third-stage small batch sample subset, inputting the third-stage small batch sample subset into a second-stage training model to perform gradient calculation, updating parameters of the second-stage training model, repeating the processes of probability sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated third-stage training model parameters, and configuring to obtain a trained course learning model.
2. The international multi-label disease classification training method according to claim 1, wherein a multi-label international disease training sample set is obtained and divided into a plurality of training sample subsets to form a training sample subset set, and the method specifically comprises the following steps:
acquiring an electronic medical record, and preprocessing the electronic medical record to obtain a multi-label international disease training sample set;
counting a multi-label international disease training sample set to obtain the probability distribution of international disease labels;
dividing the multi-label international disease training sample set into a plurality of training sample subsets based on international disease labels in the multi-label international disease training sample set to form a training sample subset set.
3. The multi-label international disease classification training method according to claim 2, wherein the electronic medical record is preprocessed to obtain a multi-label international disease training sample set, and the method specifically comprises the following steps:
deleting punctuation marks, numbers, common words and meaningless fields in the electronic medical record to obtain an initial training sample set;
performing word segmentation on the initial training sample set to generate a word segmentation dictionary;
calculating TF-IDF scores of all the participles in the participle dictionary, setting a TF-IDF score threshold range, reserving the participles of which the TF-IDF scores are in the TF-IDF score threshold range, and obtaining the multi-label international disease training sample set.
4. The multi-label international disease classification training method according to claim 2, wherein the method for obtaining the international disease label probability distribution by counting the multi-label international disease training sample set specifically comprises the following steps:
counting a multi-label international disease training sample set to obtain an international disease label set of the multi-label international disease training sample set;
and calculating the distribution condition of each international disease label according to the international disease label set to obtain the probability distribution of the international disease labels.
5. The multi-label international disease classification training method of claim 2, characterized in that: dividing the multi-label international disease training sample set into a plurality of training sample subsets according to international disease labels, wherein the number of the training sample subsets is the same as that of the international disease labels.
6. The multi-label international disease classification training method according to claim 4, wherein the hierarchical sampling of the training sample subset set specifically comprises the steps of:
setting the size of a small-batch sample subset in the first stage as m, and randomly sampling k international disease labels from the international disease label set;
and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling m samples for each training sample subset in the k training sample subsets to form k first-stage small-batch sample subsets.
7. The multi-label international disease classification training method according to claim 1, wherein the scrambling and segmentation of the multi-label international disease training sample set comprises the specific steps of:
and fully shuffling the multi-label international disease training sample set, and dividing the multi-label international disease training sample set into k non-overlapping equal parts with the size of m to form k second-stage small-batch sample subsets.
8. The multi-label international disease classification training method according to claim 2, characterized in that the probability sampling of the training sample subset set comprises the specific steps of:
probability sampling k international disease labels based on the probability distribution of the international disease labels in the multi-label international disease training sample set;
and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling one sample for each of the k training sample subsets to form k third-stage small-batch sample subsets with the size of m.
9. The multi-label international disease classification training method of claim 1, wherein in one iteration, the first-stage small-batch sample subset, the second-stage small-batch sample subset and the third-stage small-batch sample subset are input into a training model to perform gradient calculation and update training model parameters in the same way, and the gradient calculation and the update training model parameters specifically include the following steps:
s1, acquiring a training model of the current stage;
s2, sequentially inputting samples in the small batch sample subset of the current stage into a training model, calculating through a loss function to obtain loss values with the same number as the samples, and averaging the loss values to obtain the loss of the iteration of the current stage;
s3, estimating gradient updating parameters based on the training model, the loss sample subset and the small batch sample subset, and updating the training model according to the gradient updating parameters;
and S4, repeating the steps S1-S3 until all the small batch sample subsets are input into the training model to perform gradient calculation and update the parameters of the training model, and ending the iteration.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210029712.0A CN114048320B (en) | 2022-01-12 | 2022-01-12 | Multi-label international disease classification training method based on course learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210029712.0A CN114048320B (en) | 2022-01-12 | 2022-01-12 | Multi-label international disease classification training method based on course learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114048320A CN114048320A (en) | 2022-02-15 |
CN114048320B true CN114048320B (en) | 2022-03-29 |
Family
ID=80196261
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210029712.0A Active CN114048320B (en) | 2022-01-12 | 2022-01-12 | Multi-label international disease classification training method based on course learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114048320B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116844717B (en) * | 2023-09-01 | 2023-12-22 | 中国人民解放军总医院第一医学中心 | Medical advice recommendation method, system and equipment based on hierarchical multi-label model |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108519978A (en) * | 2018-04-10 | 2018-09-11 | 成都信息工程大学 | A kind of Chinese document segmenting method based on Active Learning |
CN108537270A (en) * | 2018-04-04 | 2018-09-14 | 厦门理工学院 | Image labeling method, terminal device and storage medium based on multi-tag study |
CN109460473A (en) * | 2018-11-21 | 2019-03-12 | 中南大学 | The electronic health record multi-tag classification method with character representation is extracted based on symptom |
US10553319B1 (en) * | 2019-03-14 | 2020-02-04 | Kpn Innovations, Llc | Artificial intelligence systems and methods for vibrant constitutional guidance |
CN111192680A (en) * | 2019-12-25 | 2020-05-22 | 山东众阳健康科技集团有限公司 | Intelligent auxiliary diagnosis method based on deep learning and collective classification |
CN111241301A (en) * | 2020-01-09 | 2020-06-05 | 天津大学 | Knowledge graph representation learning-oriented distributed framework construction method |
CN111460091A (en) * | 2020-03-09 | 2020-07-28 | 杭州麦歌算法科技有限公司 | Medical short text data negative sample sampling method and medical diagnosis standard term mapping model training method |
CN112560900A (en) * | 2020-09-08 | 2021-03-26 | 同济大学 | Multi-disease classifier design method for sample imbalance |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108305158B (en) * | 2017-12-27 | 2020-06-09 | 阿里巴巴集团控股有限公司 | Method, device and equipment for training wind control model and wind control |
-
2022
- 2022-01-12 CN CN202210029712.0A patent/CN114048320B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108537270A (en) * | 2018-04-04 | 2018-09-14 | 厦门理工学院 | Image labeling method, terminal device and storage medium based on multi-tag study |
CN108519978A (en) * | 2018-04-10 | 2018-09-11 | 成都信息工程大学 | A kind of Chinese document segmenting method based on Active Learning |
CN109460473A (en) * | 2018-11-21 | 2019-03-12 | 中南大学 | The electronic health record multi-tag classification method with character representation is extracted based on symptom |
US10553319B1 (en) * | 2019-03-14 | 2020-02-04 | Kpn Innovations, Llc | Artificial intelligence systems and methods for vibrant constitutional guidance |
CN111192680A (en) * | 2019-12-25 | 2020-05-22 | 山东众阳健康科技集团有限公司 | Intelligent auxiliary diagnosis method based on deep learning and collective classification |
CN111241301A (en) * | 2020-01-09 | 2020-06-05 | 天津大学 | Knowledge graph representation learning-oriented distributed framework construction method |
CN111460091A (en) * | 2020-03-09 | 2020-07-28 | 杭州麦歌算法科技有限公司 | Medical short text data negative sample sampling method and medical diagnosis standard term mapping model training method |
CN112560900A (en) * | 2020-09-08 | 2021-03-26 | 同济大学 | Multi-disease classifier design method for sample imbalance |
Non-Patent Citations (4)
Title |
---|
Improving disease prediction using ICD-9 ontological features;Mihail Popescu et al.;《2011 IEEE International Conference on Fuzzy Systems》;20110901;1-8 * |
基于Spark的类别不平衡问题研究;朱文静;《中国优秀硕士学位论文全文数据库 信息科技辑》;20210215;I138-714 * |
基于词性标注的中医症候名语料库;游正洋 等;《电子技术与软件工程》;20171107;177-178 * |
面向ICD疾病分类的深度学习方法研究;张述睿 等;《计算机工程与应用》;20211231;172-180 * |
Also Published As
Publication number | Publication date |
---|---|
CN114048320A (en) | 2022-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106777891B (en) | A kind of selection of data characteristics and prediction technique and device | |
CN112256828B (en) | Medical entity relation extraction method, device, computer equipment and readable storage medium | |
CN106934235A (en) | Patient's similarity measurement migratory system between a kind of disease areas based on transfer learning | |
CN109770903A (en) | The classification prediction technique of functional magnetic resonance imaging, system, device | |
CN111343147B (en) | Network attack detection device and method based on deep learning | |
CN113053535B (en) | Medical information prediction system and medical information prediction method | |
CN111243736A (en) | Survival risk assessment method and system | |
WO2021179514A1 (en) | Novel coronavirus patient condition classification system based on artificial intelligence | |
CN112201330A (en) | Medical quality monitoring and evaluating method combining DRGs tool and Bayesian model | |
JP2020047234A (en) | Data evaluation method, device, apparatus, and readable storage media | |
CN114048320B (en) | Multi-label international disease classification training method based on course learning | |
CN112084330A (en) | Incremental relation extraction method based on course planning meta-learning | |
CN116564409A (en) | Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer | |
CN112259232B (en) | VTE risk automatic evaluation system based on deep learning | |
US11961204B2 (en) | State visualization device, state visualization method, and state visualization program | |
MacCallum et al. | Modeling multivariate change | |
CN111310792B (en) | Drug sensitivity experiment result identification method and system based on decision tree | |
CN112071431B (en) | Clinical path automatic generation method and system based on deep learning and knowledge graph | |
Özkan et al. | Effect of data preprocessing on ensemble learning for classification in disease diagnosis | |
Lafit et al. | Enabling analytical power calculations for multilevel models with autocorrelated errors through deriving and approximating the precision matrix | |
CN116959585A (en) | Deep learning-based whole genome prediction method | |
CN113889274B (en) | Method and device for constructing risk prediction model of autism spectrum disorder | |
Praserttitipong et al. | Elective course recommendation model for higher education program. | |
CN113673609B (en) | Questionnaire data analysis method based on linear hidden variables | |
Wheadon | Classification accuracy and consistency under item response theory models using the package classify |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |