CN114048320B - Multi-label international disease classification training method based on course learning - Google Patents

Multi-label international disease classification training method based on course learning Download PDF

Info

Publication number
CN114048320B
CN114048320B CN202210029712.0A CN202210029712A CN114048320B CN 114048320 B CN114048320 B CN 114048320B CN 202210029712 A CN202210029712 A CN 202210029712A CN 114048320 B CN114048320 B CN 114048320B
Authority
CN
China
Prior art keywords
training
label
international disease
training sample
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210029712.0A
Other languages
Chinese (zh)
Other versions
CN114048320A (en
Inventor
王亚强
韩旭
郝学超
舒红平
朱涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu University of Information Technology
West China Hospital of Sichuan University
Original Assignee
Chengdu University of Information Technology
West China Hospital of Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu University of Information Technology, West China Hospital of Sichuan University filed Critical Chengdu University of Information Technology
Priority to CN202210029712.0A priority Critical patent/CN114048320B/en
Publication of CN114048320A publication Critical patent/CN114048320A/en
Application granted granted Critical
Publication of CN114048320B publication Critical patent/CN114048320B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H10/00ICT specially adapted for the handling or processing of patient-related medical or healthcare data
    • G16H10/60ICT specially adapted for the handling or processing of patient-related medical or healthcare data for patient-specific data, e.g. for electronic patient records

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Databases & Information Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a multi-label international disease classification training method based on course learning, which controls label distribution through three different small batch sample sampling methods when classifying and automatically coding an international disease large-scale data set. Firstly, obtaining a multi-label international disease training sample set, and dividing the multi-label international disease training sample set into a plurality of training sample subsets; the first stage training, carrying out iterative layered sampling and gradient calculation on a training sample subset and updating model parameters in a first round; performing second-stage training, namely performing iterative scrambling segmentation and gradient calculation on the training sample set and updating model parameters in a second round; and in the third stage of training, performing iterative probability sampling and gradient calculation on the training sample subset, and updating the model parameters in the third round. The invention improves the training phase of the current mainstream model, and the improved model greatly improves the model precision and generalization capability in the ICD coding multi-label classification task.

Description

Multi-label international disease classification training method based on course learning
Technical Field
The invention relates to the technical field of natural language processing, in particular to a multi-label international disease classification training method based on course learning.
Background
International Classification of Diseases (ICD) codes are important labels for electronic health records. ICD codes are characterized by the etiology, pathology, clinical manifestations and the like of a disease, and the same class of diseases are grouped into an ordered code combination. The codes are used for quantifying important statistical data, are convenient for finding patient queues with similar diagnosis, and are also used as a standardized information exchange means among hospitals, so that the codes have important value and significance.
The automatic classification of ICD codes of electronic medical records is a significant task. On one hand, the cost of a large amount of manual classification is saved by automatic classification, and on the other hand, accurate classification can effectively assist diagnosis of doctors. In recent years, machine learning-based techniques have been demonstrated to learn classification models using electronic medical records, which require multi-label classification because each electronic case often involves multiple diseases.
The automatic classification model of ICD codes requires high accuracy and generalization capability, however, the ICD codes of electronic cases in some large-scale data sets are extremely unbalanced, which results in low accuracy and poor generalization capability of the model. Taking a famous electronic medical record set MIMIC-III (a set of 55000 electronic medical records provided by a computational physiology laboratory of the Massachusetts institute of technology), the MIMIC-III has 6000 tags distributed in a long tail, and ICD coding multi-tag classification is carried out on the MIMIC-III by three mainstream deep learning models (TextCNN, TextRNN and TextRCNN), and the classification result is shown in Table 1. From the experimental results, it can be observed that the Fscore on the test set is generally low, while the Fscore on the training set is high, and the generalization capability and the accuracy of the three models are not ideal.
TABLE 1 Experimental results of three mainstream models on MIMIC-III
Figure 413220DEST_PATH_IMAGE001
Disclosure of Invention
The invention aims to provide a curriculum learning-based multi-label international disease classification training method, which divides the traditional training process into three stages with different difficulty degrees and improves the precision and generalization capability of a classification model.
In order to achieve the purpose, the technical scheme of the invention is as follows:
a multi-label international disease classification training method based on course learning comprises the following steps:
acquiring a multi-label international disease training sample set, and dividing the multi-label international disease training sample set into a plurality of training sample subsets to form a training sample subset set;
carrying out layered sampling on the training sample subset set to obtain a first-stage small batch sample subset, inputting the first-stage small batch sample subset into a course learning model to carry out gradient calculation, updating parameters of the course learning model, repeating the processes of layered sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated parameters of the course learning model, and configuring to obtain a first-stage training model;
scrambling and segmenting the multi-label international disease training sample set to obtain a second-stage small-batch sample subset, inputting the second-stage small-batch sample subset into a first-stage training model to perform gradient calculation, updating parameters of the first-stage training model, repeating the scrambling and segmenting, gradient calculation and parameter updating processes until preset iteration times are reached to obtain updated second-stage training model parameters, and configuring to obtain a second-stage training model;
and performing probability sampling on the training sample subset set to obtain a third-stage small batch sample subset, inputting the third-stage small batch sample subset into a second-stage training model to perform gradient calculation, updating parameters of the second-stage training model, repeating the processes of probability sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated third-stage training model parameters, and configuring to obtain a trained course learning model.
Further, a multi-label international disease training sample set is obtained and divided into a plurality of training sample subsets to form a training sample subset set, and the method specifically comprises the following steps:
acquiring an electronic medical record, and preprocessing the electronic medical record to obtain a multi-label international disease training sample set;
counting a multi-label international disease training sample set to obtain the probability distribution of international disease labels;
dividing the multi-label international disease training sample set into a plurality of training sample subsets based on international disease labels in the multi-label international disease training sample set to form a training sample subset set.
Further, preprocessing the electronic medical record to obtain a multi-label international disease training sample set, and specifically comprises the following steps:
deleting punctuation marks, numbers, common words and meaningless fields in the electronic medical record to obtain an initial training sample set;
performing word segmentation on the initial training sample set to generate a word segmentation dictionary;
calculating TF-IDF scores of all the participles in the participle dictionary, setting a TF-IDF score threshold range, reserving the participles of which the TF-IDF scores are in the TF-IDF score threshold range, and obtaining the multi-label international disease training sample set.
Further, counting a multi-label international disease training sample set to obtain the probability distribution of the international disease labels, and specifically comprising the following steps:
counting a multi-label international disease training sample set to obtain an international disease label set of the multi-label international disease training sample set;
and calculating the distribution condition of each international disease label according to the international disease label set to obtain the probability distribution of the international disease labels.
Further, the multi-label international disease training sample set is divided into a plurality of training sample subsets according to the international disease labels, and the number of the training sample subsets is the same as that of the international disease labels.
Further, the hierarchical sampling of the training sample subset set specifically includes the following steps:
setting the size of a small-batch sample subset in the first stage as m, and randomly sampling k international disease labels from the international disease label set;
and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling m samples for each training sample subset in the k training sample subsets to form k first-stage small-batch sample subsets.
Further, the scrambling and segmenting the multi-label international disease training sample set comprises the following specific steps:
and fully shuffling the multi-label international disease training sample set, and dividing the multi-label international disease training sample set into k non-overlapping equal parts with the size of m to form k second-stage small-batch sample subsets.
Further, performing probabilistic sampling on the training sample subset set includes the specific steps of:
probability sampling k international disease labels based on the probability distribution of the international disease labels in the multi-label international disease training sample set;
and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling one sample for each of the k training sample subsets to form k third-stage small-batch sample subsets with the size of m.
Further, in one iteration, the first-stage small-batch sample subset, the second-stage small-batch sample subset, and the third-stage small-batch sample subset are input into the training model for gradient calculation and updating of the parameters of the training model in the same manner, and the method for gradient calculation and updating of the parameters of the training model specifically includes the following steps:
s1, acquiring a training model of the current stage;
s2, sequentially inputting samples in the small batch sample subset of the current stage into a training model, calculating through a loss function to obtain loss values with the same number as the samples, and averaging the loss values to obtain the loss of the iteration of the current stage;
s3, estimating gradient updating parameters based on the training model, the loss sample subset and the small batch sample subset, and updating the training model according to the gradient updating parameters;
and S4, repeating the steps S1-S3 until all the small batch sample subsets are input into the training model to perform gradient calculation and update the parameters of the training model, and ending the iteration.
Compared with the prior art, the invention has the following advantages:
the multi-label international disease classification automatic coding method based on course learning improves three current mainstream models of textCNN, textRNN and textRCNN. The label distribution of the small-batch sample subset is controlled by three different small-batch sample sampling methods, and the label distribution represents the difficulty level of the stage. For multi-label classification of ICD codes, labels with more samples are easy to learn, and labels with less samples are difficult to learn in the training process, so that a hierarchical sampling method is used in the first stage of the training process to ensure that a small batch of sample subsets contain a large number of samples with different labels, and a model learns all kinds of labels quickly in the early stage of training; in the second stage of the training process, a small-batch sample subset is obtained by using a conventional scrambling and slicing algorithm; and the third training stage uses a guided probabilistic sampling method to make the distribution of the small batch sample subset closer to that of the original data set. The improved course learning model greatly improves the model precision and generalization capability on the ICD coding multi-label classification task.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It is obvious that the drawings in the following description are some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive exercise.
FIG. 1 is a schematic flow chart of three training phases of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention. It is to be understood that the embodiments described are only a few embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The examples are given for the purpose of better illustration of the invention, but the invention is not limited to the examples. Therefore, those skilled in the art should make insubstantial modifications and adaptations to the embodiments of the present invention in light of the above teachings and remain within the scope of the invention.
Example 1
The Medical Information Mart for Intensive Care (MIMIC) data set is a Medical development data set based on Intensive Care patient monitoring. The method aims to promote medical research and improve ICU decision support level. In this embodiment, the Discharge record (Discharge summary) in the text record event table (notes) in the MIMIC is used as an electronic case, and the ICD-9 code corresponding to the electronic case is predicted.
In this embodiment, data cleansing work is performed on the original electronic case. After punctuation, numbers, stop words and some meaningless fields like "advice Date" in the case are removed, the entire dataset is tokenized and a tokenization dictionary is generated. And then calculating the TF-IDF score of each participle in the dictionary, wherein the TF-IDF can evaluate the importance degree of the participle to a corpus. 10000 word segmentations with TF-IDF scores within the preset threshold range are reserved, while the word segmentations with TF-IDF scores not within the preset threshold range are deleted, and the detailed statistical information of the processed data set is shown in Table 2.
TABLE 2 detailed statistics of data set MIMIC
Figure 517311DEST_PATH_IMAGE002
MIMIC contains 55177 electronic cases in total, including 6919 ICD-9 codes. The processed data set has 898 participles and 11 labels per sample on average, and all samples form a training sample set
Figure 969152DEST_PATH_IMAGE003
Where n is the total number of training samples.
Statistical training sample set
Figure 309170DEST_PATH_IMAGE003
Obtaining a training sample label set
Figure 920280DEST_PATH_IMAGE004
And c is the total label number of the sample. Dividing a sample set D into c subsets according to the labels of the samples:
Figure 414846DEST_PATH_IMAGE005
obtaining a sample subset set layered according to labels
Figure 337672DEST_PATH_IMAGE006
There are two types of evaluation indicators for the multi-label classification task, sample-based measurements and label-based measurements, respectively. Evaluating the performance of a model using tag-based measurements, including
Figure 333310DEST_PATH_IMAGE007
Figure 674292DEST_PATH_IMAGE008
And
Figure 589028DEST_PATH_IMAGE009
the calculation method is as follows:
Figure 15461DEST_PATH_IMAGE010
Figure 549210DEST_PATH_IMAGE011
Figure 993967DEST_PATH_IMAGE012
the MIMIC data set is randomly divided into a training data set, a verification data set and a test data set in a ratio of 7:1: 2. Using Adam optimization program, the learning rate was set to 0.008 and the total number of iterations was 150 rounds.
Using three models of TextCNN, TextRNN, and TextRCNN, 150 rounds of ICD-encoded multi-label classification are performed on the MIMIC dataset multiple times, and the average result is taken as a baseline. And then, the three models are trained and improved by using the method of the invention, and multi-label classification of multi-group ICD codes is carried out.
In the multi-label international disease classification training method based on curriculum learning in this embodiment, the traditional training process is divided into three stages with different difficulty levels, and the flow diagrams of the three training stages are shown in fig. 1.
Training in the first stage:
setting the size of the small-batch sample subset as m, firstly randomly sampling k labels from a label set L, and then randomly sampling k labels from the label set L
Figure 95915DEST_PATH_IMAGE006
Selecting k training sample subsets corresponding to k international disease labels, and respectively randomly sampling m samples for each training sample subset in the k training sample subsets to form k first small-batch sample subsets;
inputting the first small batch sample subset into a course learning model to sequentially calculate gradients, and updating model parameters;
and repeating the processes of layered sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated course learning model parameters, and configuring to obtain a first-stage training model.
And (3) training in the second stage:
collecting the samples
Figure 868699DEST_PATH_IMAGE003
Fully shuffling the samples and dividing the samples into k non-overlapping equal parts with the size of m to form k second-stage small-batch sample subsets;
inputting the second small batch sample subset into the first-stage training model to sequentially calculate the gradient, and updating the parameters of the first-stage training model;
and repeating the steps of scrambling and segmenting, gradient calculating and model parameter updating until reaching the preset iteration times, and configuring the second-stage training model according to the updated model parameters obtained by the second-stage training.
And (3) training in a third stage:
a. computing a sample set
Figure 65194DEST_PATH_IMAGE003
The probability distribution P of each label.
b. Firstly, under the guidance of P, k labels are probabilistically sampled, and then
Figure 380769DEST_PATH_IMAGE006
K training sample subsets corresponding to k international disease labels are selected, one sample is randomly sampled for each training sample subset in the k training sample subsets, and k third-stage small-batch sample subsets with the size of m are formed.
c. And inputting the third small batch sample subset into the second-stage training model to sequentially calculate the gradient, and updating the parameters of the second-stage training model.
d. And c, repeating the steps b and c until a preset iteration number is reached, and configuring a third-stage training model according to the updated model parameters obtained in the third-stage training stage, namely the course learning model with the finally obtained parameters updated.
In the first stage, the second stage and the third stage training process, after a small batch of sample subsets are input into the model in each iteration, the steps of calculating the gradient and updating the model parameters are consistent, and the steps of calculating the gradient and updating the model parameters specifically comprise the following steps:
s1, obtaining k small batch sample subsets B through the current-stage sampling method, and forming a small batch sample subset set required by one iteration
Figure 247094DEST_PATH_IMAGE013
Where k is equal to the training set size | D | divided by the mini-batch sample subset size m.
S2, obtaining the model θ of the current stage, the initial state of θ is generally randomly drawn from a distribution (e.g., a uniform distribution).
S3, mixing
Figure 897387DEST_PATH_IMAGE013
M samples in the middle and small batch sample subset B are sequentially sent into a model theta and finally obtained through loss function calculationmAveraging the loss values to obtain the loss of the iteration of the current roundl
S4, according to the model theta, the loss l and the small batch sample subset B, the gradient update Δ theta can be estimated, and the model is updated: θ = θ - μ Δ θ, where μ is the over-parameter learning rate.
S5, repeating the steps S2-S4k times until the use
Figure 648305DEST_PATH_IMAGE013
And (5) completing model updating on all small batch sample subsets, and ending the iteration.
The difficulty of the three sampling methods determines the training phase in which they are located, and the simpler the method should be performed earlier in the training. In the course learning-based multi-label international disease classification training method in the embodiment, the model is expected to be contacted with the small-batch sample subset with balanced label types in the early training stage, and the model is contacted with the small-batch sample subset with the label distribution close to the training sample in the later training stage. In the embodiment, a method for measuring the class balance of the small-batch sample subset and two methods for measuring the distribution difference of the sample labels are selected to measure the difficulty level.
Standard deviation of number of various labels
Figure 67654DEST_PATH_IMAGE014
Whether the distribution of the labels of the small-batch sample subset is balanced or not can be well measured, the smaller the numerical value is, the more balanced the label distribution is, and the higher the label diversity is. The calculation formula is as follows:
Figure 636039DEST_PATH_IMAGE015
where c is the number of label categories,
Figure 258781DEST_PATH_IMAGE016
for the number of samples containing the ith label,
Figure 797079DEST_PATH_IMAGE017
is c number
Figure 946300DEST_PATH_IMAGE016
Average of (d).
The K _ L distance is used to measure the closeness between two probability distributions,
Figure 560952DEST_PATH_IMAGE018
representing the distance of the distribution p compared with the distribution q, the larger the numerical value is, the larger the distribution gap is, when p = q
Figure 920258DEST_PATH_IMAGE019
The formula is as follows:
Figure 12979DEST_PATH_IMAGE020
wherein, in the step (A),
Figure 16707DEST_PATH_IMAGE021
and
Figure 51528DEST_PATH_IMAGE022
are respectively provided withRepresenting the probability that the ith element occurs in probability distributions p and q. The K _ L distance has an asymmetry,
Figure 648863DEST_PATH_IMAGE018
in most cases not equal to
Figure 404329DEST_PATH_IMAGE023
The J _ S distance is an optimization of the K _ L distance, and compared with the K _ L distance, the J _ S distance has symmetry and can better measure two probability distribution distances, namely
Figure 387198DEST_PATH_IMAGE024
. The formula is as follows:
Figure 343652DEST_PATH_IMAGE025
the K _ L distance and the J _ S distance can well measure the distance of label distribution between the small batch sample subset and the original training data set.
The difficulty measurements for the three methods are shown in table 3.
TABLE 3 Small batch sample subset difficulty measurement
Figure 552917DEST_PATH_IMAGE026
According to the results, the label class balance in the small-batch sample subset can be well kept by using the hierarchical sampling strategy, and compared with the two methods, the probability sampling method can better keep the sample label distribution of the training set. By combining two factors, courses are trained according to the sequence of the layering sampling (2), the scrambling segmentation (0) and the probability sampling (1) from easy to difficult through the difficulty measurer of the experiment.
And inputting the test data set into three course learning models of TextCNN, TextRNN and TextRCNN after the parameters are updated, and comparing the classification result with the classification result obtained by the three course learning models which are not updated.
Table 4 shows the experimental results of the three models, and the lesson learning-based multi-label international disease classification automatic encoding method improves the Fscore of the three models TextCNN, TextRNN, and TextRCNN by 33.4%, 37.4%, and 45%, respectively, and the improvement effect is very obvious. The classification method in the embodiment is proved to be capable of improving the performance of ICD coding multi-label classification of the model on large-scale data with unbalanced label distribution.
TABLE 4 Classification results of course learning model after three parameters update
Figure 502287DEST_PATH_IMAGE027
Table 5 shows the generalization ability of the classification method in this example on MIMIC-III. It can be seen that the training set Fscore and the test set Fscore of the three models are very close. By comparing the results with the results in table 4, it can be proved that the classification method in the embodiment can effectively improve the generalization capability of the model.
TABLE 5 generalization ability of the classification method in this example on MIMIC-III
Figure 355974DEST_PATH_IMAGE028
Table 6 shows the results of textCNN and textRCNN under the same conditions (150 runs at 7000 for minilot size) in various other sequences for the three sampling methods, all averaged over multiple experimental knots. The two models simultaneously achieved the best results in the 2-0-1 sampling mode, which is consistent with the expectation, demonstrating the correctness of the difficulty measurement method used in this example.
TABLE 6 results of course learning model under all sampling modes
Figure 76805DEST_PATH_IMAGE029
While the present invention has been described with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, which are illustrative and not restrictive, and it will be apparent to those skilled in the art that various changes and modifications can be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (9)

1. A multi-label international disease classification training method based on course learning is characterized by comprising the following steps:
acquiring a multi-label international disease training sample set, and dividing the multi-label international disease training sample set into a plurality of training sample subsets to form a training sample subset set;
carrying out layered sampling on the training sample subset set to obtain a first-stage small batch sample subset, inputting the first-stage small batch sample subset into a course learning model to carry out gradient calculation, updating parameters of the course learning model, repeating the processes of layered sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated parameters of the course learning model, and configuring to obtain a first-stage training model;
scrambling and segmenting the multi-label international disease training sample set to obtain a second-stage small-batch sample subset, inputting the second-stage small-batch sample subset into a first-stage training model to perform gradient calculation, updating parameters of the first-stage training model, repeating the scrambling and segmenting, gradient calculation and parameter updating processes until preset iteration times are reached to obtain updated second-stage training model parameters, and configuring to obtain a second-stage training model;
and performing probability sampling on the training sample subset set to obtain a third-stage small batch sample subset, inputting the third-stage small batch sample subset into a second-stage training model to perform gradient calculation, updating parameters of the second-stage training model, repeating the processes of probability sampling, gradient calculation and parameter updating until reaching a preset iteration number to obtain updated third-stage training model parameters, and configuring to obtain a trained course learning model.
2. The international multi-label disease classification training method according to claim 1, wherein a multi-label international disease training sample set is obtained and divided into a plurality of training sample subsets to form a training sample subset set, and the method specifically comprises the following steps:
acquiring an electronic medical record, and preprocessing the electronic medical record to obtain a multi-label international disease training sample set;
counting a multi-label international disease training sample set to obtain the probability distribution of international disease labels;
dividing the multi-label international disease training sample set into a plurality of training sample subsets based on international disease labels in the multi-label international disease training sample set to form a training sample subset set.
3. The multi-label international disease classification training method according to claim 2, wherein the electronic medical record is preprocessed to obtain a multi-label international disease training sample set, and the method specifically comprises the following steps:
deleting punctuation marks, numbers, common words and meaningless fields in the electronic medical record to obtain an initial training sample set;
performing word segmentation on the initial training sample set to generate a word segmentation dictionary;
calculating TF-IDF scores of all the participles in the participle dictionary, setting a TF-IDF score threshold range, reserving the participles of which the TF-IDF scores are in the TF-IDF score threshold range, and obtaining the multi-label international disease training sample set.
4. The multi-label international disease classification training method according to claim 2, wherein the method for obtaining the international disease label probability distribution by counting the multi-label international disease training sample set specifically comprises the following steps:
counting a multi-label international disease training sample set to obtain an international disease label set of the multi-label international disease training sample set;
and calculating the distribution condition of each international disease label according to the international disease label set to obtain the probability distribution of the international disease labels.
5. The multi-label international disease classification training method of claim 2, characterized in that: dividing the multi-label international disease training sample set into a plurality of training sample subsets according to international disease labels, wherein the number of the training sample subsets is the same as that of the international disease labels.
6. The multi-label international disease classification training method according to claim 4, wherein the hierarchical sampling of the training sample subset set specifically comprises the steps of:
setting the size of a small-batch sample subset in the first stage as m, and randomly sampling k international disease labels from the international disease label set;
and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling m samples for each training sample subset in the k training sample subsets to form k first-stage small-batch sample subsets.
7. The multi-label international disease classification training method according to claim 1, wherein the scrambling and segmentation of the multi-label international disease training sample set comprises the specific steps of:
and fully shuffling the multi-label international disease training sample set, and dividing the multi-label international disease training sample set into k non-overlapping equal parts with the size of m to form k second-stage small-batch sample subsets.
8. The multi-label international disease classification training method according to claim 2, characterized in that the probability sampling of the training sample subset set comprises the specific steps of:
probability sampling k international disease labels based on the probability distribution of the international disease labels in the multi-label international disease training sample set;
and selecting k training sample subsets corresponding to k international disease labels from the training sample subset set, and respectively randomly sampling one sample for each of the k training sample subsets to form k third-stage small-batch sample subsets with the size of m.
9. The multi-label international disease classification training method of claim 1, wherein in one iteration, the first-stage small-batch sample subset, the second-stage small-batch sample subset and the third-stage small-batch sample subset are input into a training model to perform gradient calculation and update training model parameters in the same way, and the gradient calculation and the update training model parameters specifically include the following steps:
s1, acquiring a training model of the current stage;
s2, sequentially inputting samples in the small batch sample subset of the current stage into a training model, calculating through a loss function to obtain loss values with the same number as the samples, and averaging the loss values to obtain the loss of the iteration of the current stage;
s3, estimating gradient updating parameters based on the training model, the loss sample subset and the small batch sample subset, and updating the training model according to the gradient updating parameters;
and S4, repeating the steps S1-S3 until all the small batch sample subsets are input into the training model to perform gradient calculation and update the parameters of the training model, and ending the iteration.
CN202210029712.0A 2022-01-12 2022-01-12 Multi-label international disease classification training method based on course learning Active CN114048320B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210029712.0A CN114048320B (en) 2022-01-12 2022-01-12 Multi-label international disease classification training method based on course learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210029712.0A CN114048320B (en) 2022-01-12 2022-01-12 Multi-label international disease classification training method based on course learning

Publications (2)

Publication Number Publication Date
CN114048320A CN114048320A (en) 2022-02-15
CN114048320B true CN114048320B (en) 2022-03-29

Family

ID=80196261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210029712.0A Active CN114048320B (en) 2022-01-12 2022-01-12 Multi-label international disease classification training method based on course learning

Country Status (1)

Country Link
CN (1) CN114048320B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116844717B (en) * 2023-09-01 2023-12-22 中国人民解放军总医院第一医学中心 Medical advice recommendation method, system and equipment based on hierarchical multi-label model

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108519978A (en) * 2018-04-10 2018-09-11 成都信息工程大学 A kind of Chinese document segmenting method based on Active Learning
CN108537270A (en) * 2018-04-04 2018-09-14 厦门理工学院 Image labeling method, terminal device and storage medium based on multi-tag study
CN109460473A (en) * 2018-11-21 2019-03-12 中南大学 The electronic health record multi-tag classification method with character representation is extracted based on symptom
US10553319B1 (en) * 2019-03-14 2020-02-04 Kpn Innovations, Llc Artificial intelligence systems and methods for vibrant constitutional guidance
CN111192680A (en) * 2019-12-25 2020-05-22 山东众阳健康科技集团有限公司 Intelligent auxiliary diagnosis method based on deep learning and collective classification
CN111241301A (en) * 2020-01-09 2020-06-05 天津大学 Knowledge graph representation learning-oriented distributed framework construction method
CN111460091A (en) * 2020-03-09 2020-07-28 杭州麦歌算法科技有限公司 Medical short text data negative sample sampling method and medical diagnosis standard term mapping model training method
CN112560900A (en) * 2020-09-08 2021-03-26 同济大学 Multi-disease classifier design method for sample imbalance

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108305158B (en) * 2017-12-27 2020-06-09 阿里巴巴集团控股有限公司 Method, device and equipment for training wind control model and wind control

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108537270A (en) * 2018-04-04 2018-09-14 厦门理工学院 Image labeling method, terminal device and storage medium based on multi-tag study
CN108519978A (en) * 2018-04-10 2018-09-11 成都信息工程大学 A kind of Chinese document segmenting method based on Active Learning
CN109460473A (en) * 2018-11-21 2019-03-12 中南大学 The electronic health record multi-tag classification method with character representation is extracted based on symptom
US10553319B1 (en) * 2019-03-14 2020-02-04 Kpn Innovations, Llc Artificial intelligence systems and methods for vibrant constitutional guidance
CN111192680A (en) * 2019-12-25 2020-05-22 山东众阳健康科技集团有限公司 Intelligent auxiliary diagnosis method based on deep learning and collective classification
CN111241301A (en) * 2020-01-09 2020-06-05 天津大学 Knowledge graph representation learning-oriented distributed framework construction method
CN111460091A (en) * 2020-03-09 2020-07-28 杭州麦歌算法科技有限公司 Medical short text data negative sample sampling method and medical diagnosis standard term mapping model training method
CN112560900A (en) * 2020-09-08 2021-03-26 同济大学 Multi-disease classifier design method for sample imbalance

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Improving disease prediction using ICD-9 ontological features;Mihail Popescu et al.;《2011 IEEE International Conference on Fuzzy Systems》;20110901;1-8 *
基于Spark的类别不平衡问题研究;朱文静;《中国优秀硕士学位论文全文数据库 信息科技辑》;20210215;I138-714 *
基于词性标注的中医症候名语料库;游正洋 等;《电子技术与软件工程》;20171107;177-178 *
面向ICD疾病分类的深度学习方法研究;张述睿 等;《计算机工程与应用》;20211231;172-180 *

Also Published As

Publication number Publication date
CN114048320A (en) 2022-02-15

Similar Documents

Publication Publication Date Title
CN106777891B (en) A kind of selection of data characteristics and prediction technique and device
CN112256828B (en) Medical entity relation extraction method, device, computer equipment and readable storage medium
CN106934235A (en) Patient's similarity measurement migratory system between a kind of disease areas based on transfer learning
CN109770903A (en) The classification prediction technique of functional magnetic resonance imaging, system, device
CN111343147B (en) Network attack detection device and method based on deep learning
CN113053535B (en) Medical information prediction system and medical information prediction method
CN111243736A (en) Survival risk assessment method and system
WO2021179514A1 (en) Novel coronavirus patient condition classification system based on artificial intelligence
CN112201330A (en) Medical quality monitoring and evaluating method combining DRGs tool and Bayesian model
JP2020047234A (en) Data evaluation method, device, apparatus, and readable storage media
CN114048320B (en) Multi-label international disease classification training method based on course learning
CN112084330A (en) Incremental relation extraction method based on course planning meta-learning
CN116564409A (en) Machine learning-based identification method for sequencing data of transcriptome of metastatic breast cancer
CN112259232B (en) VTE risk automatic evaluation system based on deep learning
US11961204B2 (en) State visualization device, state visualization method, and state visualization program
MacCallum et al. Modeling multivariate change
CN111310792B (en) Drug sensitivity experiment result identification method and system based on decision tree
CN112071431B (en) Clinical path automatic generation method and system based on deep learning and knowledge graph
Özkan et al. Effect of data preprocessing on ensemble learning for classification in disease diagnosis
Lafit et al. Enabling analytical power calculations for multilevel models with autocorrelated errors through deriving and approximating the precision matrix
CN116959585A (en) Deep learning-based whole genome prediction method
CN113889274B (en) Method and device for constructing risk prediction model of autism spectrum disorder
Praserttitipong et al. Elective course recommendation model for higher education program.
CN113673609B (en) Questionnaire data analysis method based on linear hidden variables
Wheadon Classification accuracy and consistency under item response theory models using the package classify

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant