CN116110582A

CN116110582A - Health risk assessment method based on pre-training and multitasking bidirectional regulation mechanism

Info

Publication number: CN116110582A
Application number: CN202310119327.XA
Authority: CN
Inventors: 林绍福; 王梦真; 陈建辉
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-05-12

Abstract

The invention discloses a health risk assessment method based on a pre-training and multi-task bidirectional regulation mechanism, which relates to the related technology of intelligent medical treatment and deep learning, applies a large amount of discarded single physical examination data, fuses a pre-training model to learn the representation of a text, reduces the need of a large training data set, develops a multi-task learning framework with the bidirectional regulation mechanism, fuses pre-training and fine-tuning, and relieves upstream and downstream gaps caused by forgetting pre-training parameters of the model. The data of a medical institution in Hainan province is used for training test, and experiments show that the model can effectively improve the accuracy of health risk assessment.

Description

Health risk assessment method based on pre-training and multitasking bidirectional regulation mechanism

Technical Field

The invention relates to the field of intelligent medical treatment, a pre-training technology and a multi-task technology, in particular to a health risk assessment method based on a pre-training and multi-task bidirectional regulation mechanism.

Background

With the rapid development of internet medical treatment, the type and the scale of medical data are increased at an unprecedented speed, so that better health management can be provided for people, and a new method for preventing, predicting and treating chronic diseases is developed. In medical data-based internet applications, developing and constructing an effective chronic risk prediction model is of great value in chronic disease management. Despite the ongoing development of medicine, there is an increasing interest in precision medicine, but most diagnoses occur when patients begin to develop obvious signs of disease. Early diagnosis and detection of disease may provide patients and caregivers with the opportunity for early intervention, better disease management, and efficient allocation of medical resources. For persons in the preclinical stage of chronic disease or at risk of chronic disease, the progression of the disease may be significantly reduced by changing lifestyle or effective drug treatment, and thus prevention of chronic disease is particularly important.

In recent years, the deep learning model is widely applied to a plurality of fields such as finance, material science, environment and the like, and digital medical files such as electronic health files and the like are more perfect, so that more possibility is provided for the development of deep learning in the medical field. More and more researches focus on electronic health files, mine modeling and analysis from real data, and advance clinical knowledge question-answering systems, health early warning models, auxiliary diagnosis and electronic prescription recommendation development. However, since the medical system is complex and non-uniform, the data quality is uneven, the cleaning and labeling are time-consuming and labor-consuming, and a large amount of EHR data which is in order is difficult to obtain. The problem of data dependence and insufficient training samples of the deep learning model can be solved by the transfer learning, the current trend of the transfer learning is to pretrain by training a large-scale general data set and transfer acquired knowledge to a target task through fine tuning, so that the accuracy of the small-sample data training model is improved, and a plurality of researches can prove that the pretraining and fine tuning method is effective on a natural language processing task. However, large-scale model pre-training is costly, many studies are based on published, pre-trained models, but for some specific areas of expertise, such as medicine, academic papers, legal documents, etc., generic pre-training language models can be less effective because these texts contain a large number of terms of art and sentences, as opposed to generic text forms used for pre-training.

In the electronic health record data, isomorphic data of different institutions are difficult to obtain, single-time diagnosis data in physical examination records of patients are far more than multiple-time diagnosis data, most of current researches focus on longitudinal diagnosis records, and the single-time diagnosis data are often abandoned, so that a large number of single-time physical examination data are utilized for model pre-training, and training data sets of the pre-training and the fine-tuning are similar, and the model migration effect can be improved. In the prior model pre-training, a plurality of tasks are usually designed, features with different dimensionalities of the learning data are usually studied, but different task learning strategies are different, and the weight coefficient is also different, but the point is not considered in the prior pre-training model. The study fuses the pre-training model to learn the representation of the text, applies a large amount of discarded single physical examination data, provides better model initialization, and can effectively avoid the problem of overfitting of small sample data in a follow-up fine adjustment task. At the same time, the need for large training data sets is reduced, potentially yielding more generic models. And a dynamic adjustment mechanism for loss weight is added in the pre-training task, so that each task is optimized as much as possible and does not affect each other.

For this purpose, the patent proposes a health risk assessment model based on a pretraining and multitasking bi-directional adjustment mechanism. The method comprises the steps of adapting a pre-training task of a transducer model, integrating random shielding operation, designing a self-composition prediction task and a target prediction task, and enabling the model to fully learn the interrelationship between elements and the relation between the elements and the target elements through the two pre-training tasks; and a multi-task learning framework with a two-way regulation mechanism is developed, pre-training and fine-tuning are fused, and upstream and downstream gaps caused by forgetting pre-training parameters of a model are relieved, so that the problem of excessive fitting caused by local optimization under the condition of a small sample is prevented.

Disclosure of Invention

The invention aims to provide a health risk assessment model based on a pre-training and multi-task bidirectional regulation mechanism, which is used for providing assessment probability of health risk. Firstly, an adaptation pre-training task is proposed, and a transducer model is used for model pre-training; secondly, a multi-task framework with a bidirectional adjustment mechanism is provided, pre-training and fine adjustment are fused, the problem of over fitting caused by forgetting model pre-training parameters is solved, the pre-training parameters are adjusted through gradient updating for several times, and then the whole model is trained in a downstream prediction task by utilizing the updated parameters, so that the convergence rate is increased.

The specific steps of the invention are as follows:

(1) The method comprises the steps of acquiring a personal health file data set, preprocessing the personal health file data, dividing the personal health file data into single visit and multiple visit, and storing the single visit and the multiple visit data into PKL files.

(2) And (3) performing dictionary embedded representation by using the PKL file in the step (1), and pre-training PKL data of single diagnosis by using a transducer to construct a pre-training module, so as to fully mine single diagnosis data information.

(3) And (3) aggregating all the doctor-making embedments by using PKL data of multiple doctor-making, constructing a health risk assessment model based on the pre-training model in the step (2), and predicting the current risk state.

(4) The training module and the health risk assessment module are connected by adopting a multi-task learning framework, bidirectional adjustment is carried out between the training module and the risk assessment module, the method and the device have the advantages that excessive fitting caused by local optimization under the condition of a small sample is prevented, the overall framework is guaranteed to take risk assessment as a main task, the pre-training module is a sub-task, and the discarded data is utilized to carry out auxiliary work of the main task.

In step (1), a data set is constructed using personal HEALTH examination data (person_health_exam) obtained from a medical institution in the south of the hainan province, the physical examination data including various information such as physical examination, laboratory examination, life habit, etc., and hypertension is selected as a chronic disease for HEALTH risk assessment, and risk factors related to the onset of hypertension are known from the 2018 revision of the chinese hypertension control guidelines: overweight and obesity, excessive drinking, age, lack of physical labor, dyslipidemia, and the like. The experiment thus connects the desensitized health profile to the personal profile and retains 20 attributes (e.g., age, waist circumference, BMI, etc.). Wherein, height, weight, waistline, BMI can reflect whether overweight and obesity, total Cholesterol (TCHO), triglyceride (TG), serum Low Density Lipoprotein Cholesterol (LDLC), serum High Density Lipoprotein Cholesterol (HDLC) can reflect whether dyslipidemia is abnormal, and EXERCISE frequency (EXERCISE_FREQ_CODE) and DRINKING frequency (DRINKING_FREQ_CODE) can reflect whether physical labor is lack and excessive DRINKING is excessive, besides, the factors of body temperature, heart rate, pulse, systolic pressure, diastolic pressure and the like can be increased to objectively reflect the physical state of the current patient. The model structure in the study mainly depends on objective anthropometric data, and is also integrated with life habit data including smoking, drinking, exercise frequency and the like. Judging whether the patient is physical examination data before illness and within three years according to two fields of 'whether hypertension is suffered from' and 'date of diagnosing hypertension' in personal health record basic information (PERSON_INFO) and combining with personal health physical examination data 'physical examination time', wherein the physical examination data indicates that the patient is ill within three years, namely a positive sample; similarly, data from undiagnosed hypertension and data from more than three years prior to illness, negative samples.

In step (2), each electronic health record is converted into a set of multidimensional embedding as input to a subsequent module. Each record includes information of patient ID, physical examination ID, age, body temperature, pulse rate, respiratory rate, left systolic pressure, left diastolic pressure, right systolic pressure, right diastolic pressure, height, weight, waist circumference, BMI, heart rate, total cholesterol, triglycerides, low density lipoprotein cholesterol, high density lipoprotein cholesterol, exercise frequency code, smoking status code, drinking frequency code. While the embedded representation of the record consists of one-hot coding of each dictionary, that is to say each column has its own dictionary whose unbedding is the index value of its dictionary for subsequent model pre-training.

Similar to BERT, we pre-train single visit data. First record of complete physical examination of patient

The random masking operation is performed, and since the data is single visit data, t is negligible, and each attribute value of the patient n can be obtained:

wherein i is more than or equal to 0 and less than or equal to |P|, i is a positive integer, and |P| represents the number of attributes of one physical examination record. random_mask () is a random mask function, which can be expressed as:

the complete visit record admissions can thus be expressed as:

the expression of the connection is adm 0, adm 1 … adm P, and the connection is made

Unlike the Bert model, since there is no precedence relation between each attribute value, position_casting is removed; in the single visit record, any visit record has no correlation, so the segment_pulsing has no practical meaning. Therefore, only word_casting is reserved with emphasis, and the model prediction depends on the numerical value of the attribute. In this study, some patient sign data or lifestyle data were missing due to incomplete or erroneous recording of each physical examination data examination item. While we have used neighbor values of existing data to fill in missing items, the model may have difficulty adapting to downstream tasks due to the lack of some important check records. Thus, based on the operation of random masking, we devised two pre-training tasks to enhance the self-predictive and target predictive effects of the model.

Self-composition prediction task: in the pre-training task, the self-composition prediction task is designed due to the random shielding operation on the patient treatment record, so that the model has stronger self-prediction capability. The self-predicting task loss function is as follows:

wherein P is ⁽ⁿ⁾ [i]Representing the ith attribute value in the patient visit record for the nth patient, and adm ⁽ⁿ⁾ [i]Embedding, p representing the treatment of the nth patient's treatment record with the ith attribute value subjected to a transducer model and randomly masked _i ∈{VOC _i \P ⁽ⁿ⁾ [i]And p represents _i Divide P for the ith dictionary table ⁽ⁿ⁾ [i]Any other value. To reduce as much as possible

The greater the likelihood that the predicted attribute value will equal the true value, and the less likely it will equal values other than the true value.

Target prediction task: the pretraining task is set to finally predict whether hypertension exists, so that the task designs a loss function aiming at a final prediction target as follows:

wherein the method comprises the steps of

Representing the probability of each attribute predicting the "is_hyper" by the embellishment of the "is_hyper" attribute of the nth patient after the attribute is subjected to a transducer model and randomly masked, so as to obtain the minimum target prediction task loss value.

The self-composition prediction task and the target prediction task are combined to form a final optimization target, and the purpose of dynamically adjusting the weight of the total loss function is to optimize two pre-training tasks as much as possible, and the two pre-training tasks are not interfered with each other, so that certain tasks are prevented from being dominant, and other tasks cannot be fully optimized.

The total loss function is expressed as follows:

setting the dynamic adjustment parameters of the weights to be a and b, assuming that

Gradient of greater than->

Then->

Is also greater than

So at the first parameter update due to +.>

The value of the model parameter per se is larger, and the negative gradient of a is-loss_self, then the value of a is reduced by a larger ratio b, and the updating direction of the model parameter in the next updating is +.>

a becomes smaller than b, so +.>

The effect on other gradients is also smaller, which will increase the probability that the loss2,3 will move to the current minima, so that the final loss minima will be closer to each loss minima.

In step (3), the final objective is to predict health risk for multiple sequential physical examination data. We aggregate all visit embedments and add a prediction layer for the hypertension risk prediction task. Specifically, the risk assessment result prediction at the time of the t-th physical examination of a certain patient depends on the previous t-1 physical examination record, the result value of the illness or not, and the t-th physical examination record, so that the current risk state can be more effectively predicted according to the records of a plurality of physical examinations of the patient:

and averaging the physical examination record attribute ebedding of the previous t-1 times, connecting physical examination records of the t time, and outputting predicted risk probability. Define the real mark word of the t-th physical examination as

The total loss function is defined as follows:

and (4) the study adopts multi-task learning to connect the pre-training module and the disease risk prediction module, and performs bidirectional adjustment between the pre-training module and the prediction module so as to prevent overfitting caused by local optimization under the condition of a small sample, eliminate gap between the pre-training module and the prediction model, ensure that the whole frame takes disease prediction as a main task, the pre-training module is a sub task, and the discarded data is utilized to assist the main task.

To bias the model ensemble towards the task of hypertension prediction, the present study uses a MAML update strategy and incorporates the idea of "eliminating gaps between pre-trained and predicted models". For the pre-training task, we adjust the a priori parameters of the transducer model by one or several gradient descent steps

Setting the learning rate to alpha for dual adaptation of the pre-training task, new a priori parameter +.>

Can be expressed as:

the parameter ω' of the hypertension prediction model can then be expressed as:

further, we define the total loss function as follows:

wherein lambda is _{total_pre} Weights representing the pre-trained task loss function are typically set to values less than 1 so that the multitasking frame result is biased towards the hypertension prediction module.

Drawings

Fig. 1 is a diagram of the overall architecture of the present invention. The overall architecture is divided into three modules, namely an input module, model details and an output module. The input is divided into single-time diagnosis data input and multiple-time diagnosis data input, the single-time diagnosis data is input into a pre-training model, two pre-training tasks are set to be self-composition prediction tasks and target prediction tasks respectively, the pre-trained model is sent into a downstream prediction task and is trained by using multiple-time diagnosis data, the model obtained after training is sent into a multi-task frame with combined pre-training and prediction, training is carried out again, the purpose of double adjustment is achieved, and finally a model prediction result is output.

Detailed Description

For a better understanding of the technical solution of the present invention, the following detailed description of the embodiments of the present invention refers to the accompanying drawings.

It should be understood that the described embodiments are merely some, but not all, embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

Example 1

Figure one shows an overall architecture of the present invention.

The overall architecture is divided into three modules, namely an input module, model details and an output module. The input is divided into single-time diagnosis data input and multiple-time diagnosis data input, the single-time diagnosis data is input into a pre-training model, two pre-training tasks are set to be self-composition prediction tasks and target prediction tasks respectively, the pre-trained model is sent into a downstream prediction task, multiple-time diagnosis data are used for training, the model obtained after training is sent into a multi-task frame combining the pre-training with the downstream prediction for training again, the purpose of double adjustment is achieved, and finally a model prediction result is output.

The foregoing has described in detail embodiments of the invention, which are presented herein with particular reference to the drawings and are presented solely to aid in the understanding of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A health risk assessment method based on a pre-training and multi-task bidirectional regulation mechanism is characterized by comprising the following steps of:

step (1) acquiring a personal health record data set, preprocessing the personal health record data, dividing the personal health record data into single diagnosis and multiple diagnosis, and storing the single diagnosis and multiple diagnosis data as PKL files;

step (2) dictionary embedding representation is carried out by using the PKL file in the step (1), pre-training is carried out on PKL data of single diagnosis by using a transducer, a pre-training module is constructed, and single diagnosis data information is fully mined;

step (3) using PKL data of multiple visits, aggregating all the visits to embed, constructing a health risk assessment model based on the pre-training model in step (2), and predicting the current risk state;

and (4) connecting the pre-training module and the health risk assessment module by adopting a multi-task learning frame, and performing bidirectional adjustment between the pre-training module and the risk assessment module to prevent overfitting caused by local optimization under the condition of a small sample, so as to ensure that the whole frame takes the risk assessment as a main task, the pre-training module is a secondary task, and the discarded data is utilized to perform auxiliary work of the main task.

2. The method for assessing health risk based on a pretraining and multitasking bidirectional adjustment mechanism of claim 1, wherein in step (1), the step of preprocessing the personal health profile data comprises:

(1) The hypertension is selected as a chronic disease for risk assessment, and the risk factors related to the onset of hypertension are determined as follows: overweight and obese, excessive drinking, age, lack of physical labor and dyslipidemia;

(2) Repeating the deletion of the extracted data and completing the blank value by adopting the adjacent value;

(3) Judging whether the physical examination data are physical examination data before illness and within three years according to two fields of hypertension and hypertension date, and combining physical examination time of the physical examination data;

(4) Dividing the data set into a single visit data set and a plurality of visit data sets, which are single. Pkl and multi. Pkl respectively;

(5) Each code in the single file, the multi file and the single file is respectively constructed into an age dictionary, a body temperature dictionary and a pulse rate dictionary, and the dictionary data divided into multiple visits comprises the dictionary data of multiple visits and single visits;

(6) The "patient ID" in the multi-visit dataset multi.pkl was randomly divided into training, validation and test sets at 4:1:1.

3. The method of claim 1, wherein in step (2), the pre-training is performed using single visit data, the originally discarded data is used to generate a visit embedding for each record from the dictionary embedding for each EHR record, [ CLS ] is used as an initial marker for each visit embedding sequence, and each visit embedding may need to be filled in to align the input vector in order to obtain the same input length; embedding representation based on a dictionary, pre-training single-visit data by using a transducer, and fully mining single-visit data information, wherein the method comprises the following steps of:

(1) Converting each electronic health record into a group of multidimensional embedding as input of a subsequent module; the embedded representation of the record consists of one-hot codes of each dictionary, that is, each column has a respective dictionary, and the embedding is used as the index value of the dictionary so as to carry out model pre-training subsequently;

(2) Pre-training single visit data based on a transducer; we designed two pre-training tasks to enhance the model's self-prediction and target prediction effects:

self-composition prediction task: in the pre-training task, as the random shielding operation is carried out on the patient treatment record, in the random shielding task, some embedding is randomly shielded, 80% of each code is replaced by [ MASK ], 10% is unchanged and 10% is randomly replaced, and the self-composition prediction task is designed, so that the model has stronger self-prediction capability;

target prediction task: the pretraining task is set so that whether hypertension is caused or not can be finally predicted, and therefore the task performs model prediction aiming at a final prediction target.

4. The health risk assessment method based on the pre-training and multi-task bidirectional regulation mechanism according to claim 1, wherein in the step (3), the final slow disease early warning process using the MLP module is as follows:

(1) Converting EHR data of multiple visits into health physical examination characteristics for embedding, acquiring the average value of the first few patient IDs, physical examination IDs, ages, body temperatures, pulse rates, respiratory frequencies, left systolic pressures, left diastolic pressures, right systolic pressures, right diastolic pressures, heights, weights, waistlines, BMI, heart rates, total cholesterol, triglycerides, low-density lipoprotein cholesterol, high-density lipoprotein cholesterol, exercise frequency codes, smoking status codes and drinking frequency codes in the multiple visit data, connecting the average value with the embedding of the last health physical examination characteristics, and inputting the average value into a prediction module;

(2) And acquiring whether hypertension exists at the prediction time t, embedding the hypertension into the prediction time t as a classification label, and jointly calculating jaccard, f1 and pr-auc indexes by the prediction value and the true value to judge the model effect.

5. The method for assessing health risk based on a bi-directional adjustment mechanism of claim 1, wherein in step (4), the pre-training module and the health risk assessment module are connected by using multi-task learning, and bi-directional adjustment is performed between the pre-training module and the risk assessment module, as follows:

(1) Adopting an MAML updating strategy, and fusing the idea of eliminating gap between the pre-training model and the prediction model; for a pre-training task, adjusting the prior parameter phi of the transducer model through one or more gradient descent steps, and setting the learning rate to alpha to perform double adaptation of the pre-training task;

(2) Obtaining a pre-training priori parameters and hypertension early warning model parameters;

(3) Parameter updating is performed through back propagation of the overall loss function defined by the multi-task learning framework, the overall loss function is defined, and the weights of the pre-training task and the final task loss function are adjusted by using dynamic weights.