CN114386479A

CN114386479A - Medical data processing method and device, storage medium and electronic equipment

Info

Publication number: CN114386479A
Application number: CN202111498958.4A
Authority: CN
Inventors: 王振常; 郑伟; 任鹏玲; 罗德红; 蔡林坤; 赵二伟; 刘雅文; 张婷婷; 吕晗; 刘冬; 尹红霞; 赵鹏飞; 李静
Original assignee: Beijing Friendship Hospital
Current assignee: Beijing Friendship Hospital
Priority date: 2021-12-09
Filing date: 2021-12-09
Publication date: 2022-04-22
Anticipated expiration: 2041-12-09
Also published as: CN114386479B

Abstract

The specification discloses a medical data processing method, a medical data processing device, a storage medium and electronic equipment, which can at least partially solve the technical problem that the effect of model data processing is negatively influenced due to insufficient training samples in the related art. The medical data processing method in this specification evaluates a plurality of amplification strategies and determines a target strategy that is more suitable for amplification from among them. And then, amplifying the target sample set by adopting a target strategy to obtain a training sample set. Because the target strategy is determined based on the evaluation result in the multiple amplification strategies, on one hand, the number of samples contained in the training sample set obtained by amplification of the target strategy is not too insufficient; on the other hand, the amplified samples contained in the method have low risk of causing negative influence on the training of the medical prediction model. Therefore, the medical prediction model trained based on the training sample set can obtain better model performance.

Description

Medical data processing method and device, storage medium and electronic equipment

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a medical data processing method and apparatus, a storage medium, and an electronic device.

Background

AI (Artificial Intelligence) techniques have considerable potential in increasing the speed and accuracy of medical data processing, but before obtaining an Artificial Intelligence model with certain processing capability for medical data, they require extensive training of the Artificial Intelligence model, which usually requires the use of samples. If insufficient samples are available during the training process, it may lead to overfitting problems. In particular, in the field of medical data processing, the difficulty and cost of obtaining training samples (obtained from medical data) required for training a model for processing medical data are often high, and the output result of the model for processing medical data may be related to the benefits of life and health, which may cause great loss if the performance of the model is abnormal.

Disclosure of Invention

The embodiments of the present specification provide a medical data processing method, apparatus, storage medium, and electronic device, so as to partially solve the above problems in the prior art.

The embodiment of the specification adopts the following technical scheme:

in a first aspect, the present application provides a medical data processing method, comprising: respectively amplifying the original sample sets by adopting a plurality of amplification strategies to obtain a plurality of amplification sets; evaluating the plurality of amplification sets; selecting a target strategy from a plurality of amplification strategies based on the evaluation; amplifying the target sample set according to a target amplification strategy to obtain a training sample set; training the medical prediction model by utilizing a training sample set; acquiring medical data; and taking the medical data as the input parameter of the trained medical prediction model, executing the medical prediction model and outputting a medical prediction result.

In an alternative embodiment of the present description, the evaluating the plurality of amplification sets comprises: training a designated model respectively by using the plurality of amplification sets; and obtaining the evaluation results of the multiple amplification sets according to the model performance of the trained designated model.

In an alternative embodiment of the present description, the method comprises at least one of:

the specified model is obtained by training the original sample set;

the model performance is characterized by at least one of: accuracy, recall, F1-score.

In an alternative embodiment of the present description, the method further comprises: obtaining original sample data; if the original sample data presents periodic characteristics, constructing the original sample set according to a set formed by dividing results obtained by dividing the original sample data according to the data period of the original sample data; and if the original sample data does not present periodic characteristics, taking a set formed by the original sample data as the original sample set.

In an alternative embodiment of the present description, the data cycle includes any one of: the physiological cycle of the user to which the original sample data belongs, and the data acquisition cycle when the sample data is acquired.

In an alternative embodiment of the present specification, the original sample set comprises positive samples and negative samples, one of the positive samples and the negative samples is a first sample, and the other one of the positive samples and the negative samples is a second sample, and the number of the first samples is smaller than the number of the second samples; wherein the method further comprises: respectively amplifying a first sample set consisting of first samples in the original sample set by adopting the multiple amplification strategies, so that the difference between the number of the first samples in the amplified first sample set and the number of second samples in the original sample set is not greater than a first number threshold; and amplifying a sample set consisting of the amplified first sample and the second sample in the original sample set by adopting the multiple amplification strategies to obtain multiple amplification sets.

the original sample data in the original sample set is composed of time series;

the plurality of amplification strategies comprises: an amplification strategy based on an average sequence solution and an amplification strategy based on signal decomposition and reconstruction;

the medical data includes at least one of: heart rate data, pulse wave data, electrocardiogram data, myoelectric data, blood vessel ultrasonic frequency spectrum waves, blood flow data, blood pressure data, blood sugar data, blood oxygen data, body temperature data and blood cell data;

the designated model is a support vector machine, a logistic regression model, a decision tree model, a convolutional neural network model and an LSTM model.

In a second aspect, the present specification provides a medical data processing apparatus, the apparatus comprising:

an amplification set generation module configured to: respectively amplifying the original sample sets by adopting a plurality of amplification strategies to obtain a plurality of amplification sets;

an evaluation module configured to: evaluating the plurality of amplification sets;

a target policy determination module configured to: selecting a target strategy from a plurality of amplification strategies based on the evaluation;

a training sample set determination module configured to: amplifying the target sample set according to a target amplification strategy to obtain a training sample set;

a training module configured to: training the medical prediction model by utilizing a training sample set;

a medical data acquisition module configured to: acquiring medical data;

a prediction module configured to: and taking the medical data as the input parameter of the trained medical prediction model, executing the medical prediction model and outputting a medical prediction result.

The present specification provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the medical data processing method described above.

The electronic device provided by the present specification comprises a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the medical data processing method when executing the program.

The embodiment of the specification adopts at least one technical scheme which can achieve the following beneficial effects:

the medical data processing method, the medical data processing device, the storage medium and the electronic equipment in the embodiments of the present specification can at least partially solve the technical problem that the effect of model processing data is negatively affected due to insufficient training samples in the related art. The medical data processing method in this specification evaluates a plurality of amplification strategies and determines a target strategy that is more suitable for amplification from among them. And then, amplifying the target sample set by adopting a target strategy to obtain a training sample set. Because the target strategy is determined based on the evaluation result in the multiple amplification strategies, on one hand, the number of samples contained in the training sample set obtained by amplification of the target strategy is not too insufficient; on the other hand, the amplified samples contained in the method have low risk of causing negative influence on the training of the medical prediction model. Therefore, the medical prediction model trained based on the training sample set can obtain better model performance.

Drawings

The accompanying drawings, which are included to provide a further understanding of the specification and are incorporated in and constitute a part of this specification, illustrate embodiments of the specification and together with the description serve to explain the specification and not to limit the specification in a non-limiting sense. In the drawings:

fig. 1 is a schematic flow chart of a medical data processing method provided in an embodiment of the present specification;

fig. 2a is a schematic flow chart illustrating determination of a target strategy in a medical data processing method provided in an embodiment of the present specification;

fig. 2b is a schematic diagram illustrating a cardiac cycle to divide the electrocardiographic data in the medical data processing method provided in the embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a medical data processing apparatus provided in an embodiment of the present specification;

fig. 4 is a schematic view of an electronic device corresponding to 1 provided in an embodiment of the present specification.

Detailed Description

In order to make the objects, technical solutions and advantages of the present disclosure more clear, the technical solutions of the present disclosure will be clearly and completely described below with reference to the specific embodiments of the present disclosure and the accompanying drawings. It is to be understood that the embodiments described are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments obtained by a person skilled in the art based on the embodiments in the specification without making any creative effort belong to the protection scope of the specification.

The technical solutions provided by the embodiments of the present description are described in detail below with reference to the accompanying drawings.

Data amplification, also called data enhancement, is a data preprocessing mode in deep learning, the number of training samples can be increased through a data amplification technology, and then, the model is trained through the amplified samples, so that the phenomenon of overfitting of the model can be well prevented, and the generalization capability of the model is improved.

In the related art, the data amplification technology has been applied to the technical field of image processing, and also plays a certain positive role in image processing realized by means of artificial intelligence. However, data amplification techniques are rarely applied in the field of processing medical data. The medical data in the present specification comprises at least one of: data collected for a medical event (e.g., an event experienced while treating a patient), data collected for the production and development of a pharmaceutical product (e.g., a protein) that may be used during the medical event.

On one hand, the difficulty of acquiring medical data is high, for example, in a medical diagnosis scene, if the heart rate of a user needs to be acquired, the user needs to wear a heart rate acquisition device in a longer time period, and the heart rate acquisition device has a higher difficulty for most users; for another example, in a scenario of researching protein modification (pharmaceutical product research), various catalysts and influences of the environment on the protein need to be collected, and technical means required in the collection process are also complex, which causes great difficulty in medical data collection.

On the other hand, in both medical diagnosis scenarios and pharmaceutical product research scenarios, the research results are related to the life health of the user to some extent. The amplification sample obtained by the data amplification technique is not a real sample actually collected, and if the amplification means is not appropriate, it may mislead the training of the model, and may cause a large loss.

Furthermore, medical data varies greatly from individual to individual, for example, in a medical diagnosis scenario, different users may exhibit the same signs of illness, but medical data of different users may vary greatly. Under the condition, if the samples adopted in the model training process cannot be ensured to be sufficient, the model performance of the trained model is difficult to ensure.

In view of the above, the present specification provides a medical data processing procedure to at least partially solve the problems of the related art, such as less medical data and more difficulty in effectively amplifying the medical data. The execution subject of the medical data processing procedure is a medical data processing apparatus.

Fig. 1 is a medical data processing process provided in an embodiment of the present disclosure, which may specifically include one or more of the following steps:

s100: and respectively amplifying the original sample sets by adopting a plurality of amplification strategies to obtain a plurality of amplification sets.

The process in this specification is directed to a medical data processing scenario, where samples in the original sample set in this specification are also derived from historically acquired medical data. The raw sample set in this specification is used to evaluate each amplification strategy. Schematically, the logic for evaluating the amplification regimen is shown in FIG. 2a, and the plurality of amplification strategies referred to in the present specification includes amplification strategy 1 through amplification strategy n.

Illustratively, in the aforementioned medical diagnosis scenario, the medical data may include at least one of: heart rate data, pulse wave data, electrocardiogram data, myoelectric data, blood vessel ultrasonic frequency spectrum waves, blood flow data, blood pressure data, blood sugar data, blood oxygen data, body temperature data and blood cell data. In the aforementioned scenario of studying protein modification, the medical data may include at least one of: the molecular weight of the protein (specifically, the change in molecular weight with the length of modification), and the optical activity of the protein (specifically, the change in optical activity with the length of modification).

The number of samples in the original sample set is not particularly limited in this specification.

The amplification strategy in this specification is used for amplifying a sample at a stage of selecting a target strategy (in this stage, it is not determined which strategy should be specifically adopted as a sample amplification strategy adopted in training a medical prediction model), and the existing amplification strategies can be used as the amplification strategies in this specification. As can be seen, the amplification strategy in this specification is a pending strategy. The amplification strategies in the present specification may be characterized in the form of codes.

Specifically, this step first determines (which may be randomly selected, for example, any one of the amplification strategies 1 to n shown in fig. 2 a) an amplification strategy from a plurality of amplification strategies as the first strategy. Then, amplifying the original sample set by adopting a first strategy to obtain an amplification set corresponding to the first strategy. Thereafter, the first policy is determined as a second policy. And re-determining the first strategy from the amplification strategies, and determining the amplification set corresponding to the first strategy until the amplification sets corresponding to the multiple amplification strategies are determined. It can be seen that the amplification strategies and amplification sets obtained by this step can correspond one-to-one, schematically, as shown in FIG. 2 a.

The number of samples in each amplification set obtained is not particularly limited in the present specification.

For example, in the foregoing medical diagnosis scenario, the medical data collected for one of three users may be used as one sample in the original sample set; the medical data acquired for user li four may be used as another sample in the original sample set.

In the foregoing scenario of studying protein modification, the medical data collected from the protein sample set with the number a in a certain experiment can be used as one sample in the original sample set; the medical data collected by the protein sample set with the number B in the experiment can be used as another sample in the original sample set; the medical data collected for other experiments may be used as other samples in the original sample set.

That is, in the present specification, the samples in the original sample set may be distinguished according to the acquisition target of the medical data (for example, the aforementioned zhang san, protein No. a). I.e. one acquisition object, corresponds to one sample in the original sample set.

S102: evaluating the plurality of amplification sets.

The evaluation in this step is intended to determine which amplification strategy is more optimal for amplifying a sample of medical data. A more optimal amplification strategy is more suitable for amplifying medical data, and a less optimal amplification strategy is less suitable for amplifying medical data.

In an alternative embodiment of the present disclosure, the quality of the amplification set can be characterized by at least one of the following evaluation indicators: the number of samples in the amplification set, and the model performance of the specified model after training with the amplification set. Schematically, the evaluation results obtained by this step are as shown in fig. 2a as evaluation results 1 to n.

Specifically, the greater the number of samples in an amplification set, the more optimal the amplification set is; and/or the better the model performance of the medical prediction model trained by the amplification set is, the better the amplification set is. For example, the evaluation scores of an amplification set can be obtained by weighted summation of evaluation indexes of the amplification set. The higher the evaluation score, the better the amplification set.

It should be noted that the specified model in the present specification is used for evaluating the amplification set. The designated model may be a model with a classification function, and for example, the designated model may be any one of: support Vector Machines (SVMs), logistic regression models, decision tree models, convolutional neural network models, LSTM models.

In an alternative embodiment of the present disclosure, the weight used in the weighted summation may be a preset value. In another alternative embodiment of the present description, the weights are derived from the number of samples in the original sample set. Specifically, when the number of samples in the original sample set is smaller than a number threshold (preset value), the first weight is adopted as the weight of the number of samples in calculating the evaluation score of the amplification set; and when the number of the samples in the original sample set is not less than the number threshold, adopting a second weight as the weight of the number of the samples in the calculation of the evaluation score of the amplification set, wherein the first weight is greater than the second weight.

Alternatively, the specified model performance may be characterized by at least one of: accuracy, recall, F1-score. In particular, the model performance of the medical prediction model may be positively correlated with at least one of its accuracy, recall, F1-score.

S104: based on the evaluation, a target strategy is selected from a plurality of amplification strategies.

If the evaluation result obtained for the amplification set indicates the merits of the amplification strategy used in obtaining the amplification set, the evaluation score obtained in the above-described step can be used as the evaluation result.

Based on the one-to-one correspondence between the evaluation scores obtained in the previous steps and the amplification strategies, the amplification strategy with the highest evaluation score can be used as the target strategy. The target strategy is an amplification strategy which is obtained through the evaluation steps and is more suitable for amplifying medical data. Illustratively, in the example shown in fig. 2a, the evaluation result 1 indicates that the evaluation score is the highest, and the amplification strategy 1 can be the target strategy.

S106: and amplifying the target sample set according to a target amplification strategy to obtain a training sample set.

Compared with other amplification strategies, the target strategy is more suitable for amplifying medical data, so that the medical prediction model can achieve a better training effect by adopting the training sample set obtained by the target strategy.

The target sample set in the present specification is a set of target samples obtained from medical data (not data obtained by amplification) actually collected historically, which is used when training the medical prediction model. The present specification does not specifically limit the number of target samples in the target sample set.

The intersection of the original sample set and the target sample may be an empty set or a non-empty set.

In an alternative embodiment of the present disclosure, each sample in the target sample set and the sample amplified by the target amplification strategy may be used together as a sample in the training sample set. In this embodiment, the amount of the samples in the training sample set is large, so that the medical prediction model can obtain a good training effect.

In another alternative embodiment of the present specification, the sample amplified by the target amplification strategy may be used as a sample in a training sample set, and the target sample set may be used as a test set for determining whether the model converges. In this embodiment, on the one hand, a training sample set suitable for training the medical prediction model can be obtained, and on the other hand, a convergence condition can also be determined for the medical prediction model.

S108: and training the medical prediction model by utilizing the training sample set.

The medical prediction model in this specification is a model actually employed when prediction is performed on-line. Existing models which can be used for prediction can be used as medical prediction models in the specification under the condition that conditions allow. For example, the medical prediction model in the present specification may be any one of: support Vector Machines (SVMs), logistic regression models, decision tree models, convolutional neural network models, LSTM models.

In an alternative embodiment of the present disclosure, the medical prediction model may be trained in a supervised training manner. Illustratively, the training sample set includes several training samples and labels corresponding to the training samples one by one. The process of training the medical prediction model in this specification may be: and determining at least part of samples in the training sample set as target samples. And inputting the target sample into the medical prediction model to obtain an undetermined result output by the medical prediction model. And taking the difference between the to-be-determined result and the label corresponding to the target sample as the loss. And adjusting parameters of the medical prediction model by taking the loss minimization as a training target, and updating the medical prediction model by using the parameters after parameter adjustment. And then, re-determining the target sample from the training sample set, and continuing to train the medical prediction model until the obtained loss is less than a loss threshold value.

In another alternative embodiment of the present specification, the process of training the medical prediction model may be: and determining at least part of samples in the training sample set as target samples. And inputting the target sample into the medical prediction model to obtain an undetermined result output by the medical prediction model. And taking the difference between the to-be-determined result and the label corresponding to the target sample as the loss. And adjusting parameters of the medical prediction model by taking the loss minimization as a training target, and updating the medical prediction model by using the parameters after parameter adjustment. Then, the samples in the target sample are input into the parameter-updated medical prediction model, and whether the model performance (e.g., accuracy, recall, F1-score) of the parameter-updated medical prediction model is greater than a performance threshold is determined according to the output of the parameter-updated medical prediction model. If not, the target sample is determined again to continue training the medical prediction model, and if yes, the model is converged.

S110: medical data is acquired.

After the trained medical prediction model with prediction capability is obtained through the steps, the medical prediction model can be applied to a line. The medical data in this specification is data input into the model when prediction is performed on line.

It should be noted that the execution sequence of this step and the aforementioned steps S100 to S112 is not sequential.

S112: and taking the medical data as the input parameter of the trained medical prediction model, executing the medical prediction model and outputting a medical prediction result.

It should be noted that in an alternative embodiment of the present disclosure, the medical prediction model may be a model that predicts based on some type of medical data. Taking the above medical diagnosis scenario as an example, the medical prediction model is used for predicting the probability that the user will suffer from heart diseases, and the medical data only includes electrocardiographic data.

In yet another alternative embodiment of the present description, the medical prediction model may be a model that predicts based on several types of medical data. Taking the above medical diagnosis scenario as an example, the medical prediction model is used for predicting the probability that the user will suffer from heart diseases, and the medical data includes heart rate data, pulse wave data, electrocardiographic data, and myoelectric data.

In addition, in an alternative embodiment of the present specification, the medical prediction model may predict a certain prediction item based on the input medical data. Taking the foregoing medical diagnosis scenario as an example, the medical prediction model is only used for predicting the probability that the user will suffer from a heart disease ("predicting that the user will suffer from a heart disease" is a prediction item).

In yet another alternative embodiment of the present description, the medical prediction model may predict some of the prediction items based on the input medical data. Taking the above medical diagnosis scenario as an example, the medical prediction model is used for predicting the probability that the user will suffer from heart disease, the probability that the user suffers from diabetes, and the probability that the user suffers from hypertension.

The medical data processing process in the embodiment of the present specification can at least partially solve the technical problem that the effect of the model processing data is negatively affected due to the insufficient training samples in the related art. The medical data processing method in this specification evaluates a plurality of amplification strategies and determines a target strategy that is more suitable for amplification from among them. And then, amplifying the target sample set by adopting a target strategy to obtain a training sample set. Because the target strategy is determined based on the evaluation result in the multiple amplification strategies, on one hand, the number of samples contained in the training sample set obtained by amplification of the target strategy is not too insufficient; on the other hand, the amplified samples contained in the method have low risk of causing negative influence on the training of the medical prediction model. Therefore, the medical prediction model trained based on the training sample set can obtain better model performance.

In the data processing process, the involved models comprise a specified model and a medical prediction model. The relationship between the two models will now be described.

The medical prediction model in the present specification may be a model of the same type as the aforementioned specified model, or may be a model of a different type from the aforementioned specified model.

For example, in one alternative embodiment, the prescribed model and the medical prediction model are both support vector machines. In this embodiment, since the designated model and the medical prediction model are models of the same type, the quality of the amplification set can be represented more accurately by using the training effect of the amplification set on the designated model.

In yet another alternative embodiment, the specified model is a support vector machine and the medical prediction model is an LSTM model. In this embodiment, since the structural complexity of the support vector machine is lower than that of the LSTM, it is faster to evaluate the amplification set by the specified model, and the efficiency of the data processing process implemented by the technical solution in this specification can be effectively improved.

How the specified model is obtained will now be described.

The specified model in this specification is a model for evaluating an amplification set. In an optional embodiment of the present specification, each parameter of the to-be-determined model may be randomly initialized to obtain the specified model. The implementation of the process of obtaining the specified model is simple and convenient.

However, the related art has a problem that the shortage of samples required for training a model is caused by the shortage of medical data to some extent. In the present specification, there is also a problem that the original sample set used in the process of evaluating the amplification set contains insufficient samples. The method can efficiently evaluate the advantages and disadvantages of different amplification strategies on the basis of an original sample set with limited sample number through a specified model. In another optional embodiment of the present description, the undetermined model is trained by using the original sample set to obtain the specified model. The given model obtained by this embodiment has to some extent learned the predicted knowledge from the original sample set.

If a certain amplification strategy is more suitable for amplifying the original sample set, the knowledge shown in the amplification set is also matched with the knowledge shown in the original sample set, and the convergence can be accelerated on the basis that the specified model learns a certain knowledge through the training of the amplification strategy on the specified model.

If a certain amplification strategy is not suitable for amplifying the original sample set, the wrong knowledge shown in the amplification set is not matched with the knowledge shown in the original sample set, and the convergence rate can be reduced on the basis that the specified model learns a certain knowledge through training the specified model by the amplification strategy.

Therefore, the designated model obtained by training the original sample set can improve the evaluation efficiency of the amplification set.

As can be seen from the foregoing, the data targeted by the processes in this specification is medical data, which may not be solely sourced. On one hand, if the medical prediction model needs to learn knowledge for prediction from medical data with different sources, the number of required samples is large; on the other hand, the medical data from different sources can characterize the prediction knowledge from different dimensions, and the characterization mode has certain difference, so that the characteristic of the medical data from different sources can be utilized.

Generally, medical data usually shows a certain time-sequence characteristic according to the time sequence of acquisition (for example, electrocardiographic data is acquired every 5 seconds, electrocardiographic data acquired at different times are arranged according to the acquisition time, namely, time-sequence data). Illustratively, in the foregoing medical diagnosis scenario, the prediction result output by the medical prediction model indicates the probability that the user to which the input medical data belongs has a certain disease. The electrocardiogram data included in the medical data shows periodicity, and the blood pressure data has almost no periodicity, so that the periodicity shown by the data can be utilized to improve the amplification effect of the medical data.

In an optional embodiment of the present specification, after the original sample data used for constructing the original sample set is obtained, the original sample data exhibiting periodic characteristics is determined therefrom as the specified data. Then, for each of the specified data, a data cycle of the specified data is determined. And then, dividing the specified data by taking the time length shown by the data period as a step length to obtain a plurality of division results. Taking the electrocardiographic data as an example, whether the electrocardiographic data has abnormal waveforms is monitored. If so, classifying the abnormal waveform as abnormal, and classifying the normal waveform as normal. If the time length covered by the normal waveform is 5 seconds and the data cycle indicated by the electrocardiographic data is 1 second, the number of the obtained division results for the normal waveform is 5. Exemplarily, the partitioning result obtained from the original sample data is shown in fig. 2 a.

Then, each division result is used as a sample in an original sample set, and a piece of non-specified data (original sample data which does not present periodic characteristics in the original sample set) is used as a sample in the original sample set, so as to obtain the original sample set.

The data period in this description may be determined by a data acquisition period of an acquisition device acquiring medical data. Further, the process in this specification is directed to processing medical data, and in a medical diagnosis scenario, a physiological cycle of a user to which original sample data belongs may be used as a data cycle. For example, where the medical data is heart rate data, the data period may be a heart rate period (which is a physiological period).

The medical prediction model trained by the process in this specification is used for prediction based on medical data, for example, predicting the probability that a certain disease will occur. That is, the medical prediction model is able to predict the probability that a certain disease will not occur. As can be seen from the foregoing analysis, the medical data often presents certain "person-to-person" characteristics, and if the medical prediction model is trained only with the positive sample, the medical prediction model may not completely learn the difference of the sources of the samples, thereby affecting the prediction result output by the medical prediction model.

To enable the medical prediction model to learn the predictive power in a versatile manner, in an alternative embodiment of the present description, the original sample set contains positive and negative samples. For example, in the foregoing medical diagnosis scenario, if the medical prediction model is used to predict whether the user will suffer from the disease C, the historically collected medical data of the sick user wang five of the disease C may be used as a positive sample for predicting the disease C, and the medical data of the user zhao six who does not suffer from the disease C may be used as a negative sample for predicting the disease C.

In practice, there may be a phenomenon that the difference between the number of positive samples and the number of negative samples in the original sample set is large. In the present specification, the first sample is determined as the sample having a smaller number of positive samples and the second sample is determined as the sample having a larger number of negative samples.

Thereafter, when performing the amplification of step S100 for the ith (any one) amplification strategy, the first sample set composed of the first samples in the original sample set may be amplified by the ith amplification strategy to obtain a first intermediate sample set. It is determined whether the difference between the number of samples in the first intermediate sample set and the number of second samples in the original sample set is greater than a first number threshold.

And if so, determining the first intermediate sample set as a new first sample set, and then adopting an ith amplification strategy to amplify the new first sample set until the difference between the number of the samples in the first intermediate sample set obtained again and the number of the second samples in the original sample set is not greater than a first number threshold.

And if not, determining each sample in the first intermediate sample set and each second sample in the original sample set as the sample in the amplification set corresponding to the ith amplification strategy.

It can be seen that, by the "positive and negative sample equalization" strategy in this embodiment, the difference in the number of positive and negative samples is not too large in a certain amplification set.

In the scenario of continuing the above medical diagnosis, if wang has disease C but wang does not have disease D, the medical data of wang is a negative sample for predicting disease D, and disease C and disease D are related diseases (for example, disease C is hypertension and disease D is hypotension). That is, one sample in this specification may be a positive sample of one predicted item and a negative sample of another predicted item.

Then, the target item is newly determined from the respective predicted items. Until the above-mentioned "positive and negative sample equalization" strategy is performed on the predicted item.

The processes in this specification enable the determination of amplification strategies from among the various amplification strategies that are more suitable for amplifying medical data. Since medical data generally has the characteristics of time series data, in the related art, a strategy for amplifying time series data can be used as an amplification strategy in this specification.

In addition, the present specification also provides the following two amplification strategies for selection.

Amplification strategy 1: amplification strategies based on the mean sequence method.

Illustratively, the medical data collected by the protein sample set with the number a is recorded as a sample X, and the medical data collected by the protein sample set with the number B is recorded as a sample Y. The length of the time-series data of sample X is m, and the length of the time-series data of sample Y is n. Then it is marked as X ═ X₁，x₂，…x_m}Y＝{y₁，y₂，…y_nDTW distance between any two time points on sample X and sample Y is:

γ(i,j)＝{[d(x_i,y_i)]²+{min[γ(i-1,j-1),γ(i-1,j),γ(i,j-1)]}²}^1/2

wherein d (x)_i,y_i) Is two data sequence points x_iAnd y_iThe distance between them; gamma (i, j) is a slave element (x)₁,y₁) To (x)_i,y_i) The minimum cumulative distance between. The DTW distance has the characteristic of distortion, the local features of two sequences can be compared through timely conversion, expansion and compression, similarity comparison can be carried out on data sequences with different lengths, and the DTW distance has good robustness on disturbance of a time axis.

In many application scenarios, an average time series is usually used to represent a set of multiple time series. However, since the lengths of the sequences are not always equal in the set, the characteristics thereof are shifted on the time axis, and it is difficult to obtain an ideal effect by calculating the average sequence by a point-by-point correspondence method. Thus, the present specification performs data enhancement based on the dynamic time warping barycentric Averaging (DBA) algorithm. DBA is an iterative algorithm, and the size of the dynamic time warping distance is used as an optimization index for solving an average sequence. The calculation step of the DBA algorithm comprises the following two steps of (1) randomly selecting a time sequence as an initial sequence, calculating the DTW distance between the sequence and each single sequence in a target set, and still finding the association between the time point on the initial sequence and the time points on other sequences; (2) grouping each time point on the initial sequence with its associated time point, starting to calculate an average value, and updating the result to the initial sequence. This process is repeated until the updated average time series no longer changes.

Amplification strategy 2: amplification strategies based on signal decomposition and reconstruction.

The original samples may be considered as signals, since the samples in the original sample set have the property of time series data. The signal is then decomposed into different modules, and new time series are generated by changing the combination and weights of the modules. The signal is decomposed and reconstructed, for example, using a wavelet transform based method. The essence of signal decomposition based on wavelet transform is the cross-correlation of the signal with a filter bank, and reconstruction is the convolution of the decomposed signal with a mirror filter bank. Firstly, low-pass and high-pass filtering are carried out on an original signal, the average part of the signal is obtained by down-sampling the low-pass output, and the detail part of the signal is obtained by down-sampling the high-pass output. And then, respectively up-sampling the obtained coefficients of the signal average part and the detail part, and adding the coefficients through low-pass filtering and high-pass filtering to obtain a reconstructed signal of the original signal.

In addition, other amplification strategies can be applied to the present specification, and are not described herein.

In an alternative embodiment of the present description, knowledge in training samples is learned in order to make the medical prediction model more efficient. The training samples in the training sample set are sorted in advance (for example, the sorting may be sorting according to a chronological order of data generation), so as to obtain a sequence of pending samples. Then, a first sequence step length is determined according to the number of training samples in the training sample set, and the first sequence step length is positively correlated with the number of training samples in the training sample set. Dividing the sample sequence to be determined according to the first sequence step length to obtain a plurality of subsequences arranged according to the specified sequence, so that the length of each obtained subsequence is equal to the first sequence step length, the similarity between training samples in each subsequence is greater than a first similarity threshold, and the similarity between two training samples respectively belonging to any two adjacent subsequences is less than a second similarity threshold. When the medical prediction model is trained, the samples in the subsequence are sequentially input into the medical prediction model according to a specified order for training.

As can be seen from the foregoing, the medical data has differences in data sources, and the processing of the training samples in this embodiment enables the medical prediction model to have knowledge that can be specifically learned in the samples in a certain subsequence. The difference of the samples in the two adjacent subsequences is large, so that the difference knowledge embodied by the samples in the two adjacent subsequences is more vivid, and the model learning efficiency is improved.

In an alternative embodiment of the present specification, the first sequence step size is also positively correlated to a ratio of samples derived from the periodic medical data and the aperiodic medical data in the training sample set, that is, the higher the total number of training samples occupied by the samples derived from the periodic medical data is, the longer the first sequence step size is, so that the medical prediction model learns more fully the knowledge in the training samples derived from the periodic medical data.

As can be seen from the foregoing, if the medical data is periodic data, and the way of processing the data in the subsequent steps has a certain effect, it is necessary to distinguish whether the medical data is periodic data before some processing is performed on the medical data.

In an alternative embodiment of the present description, the medical data is transformed into frequency domain data using a fourier transform before or after pre-processing the medical data in order to be able to distinguish between periodic data and non-periodic data. This transformation can be implemented using the following equation (1).

Wherein F (ω) represents the frequency domain data after transformation, ω represents the frequency, t represents the time, e^-iwtRepresenting a complex function.

And then, calculating three indexes of a waveform factor, a kurtosis factor and a pulse factor based on the transformed frequency domain data as the judging characteristics of the periodicity strength of the data. Wherein the form factor C_sIs the ratio of the root mean square to the rectified mean. Root mean square X_rmsAlso called effective value, is obtained by summing the squares of all the values, then calculating the mean value and then squaring. Form factor C_sThe following equation (2) can be used. Crest factor C_pIs the ratio of the signal peak to the root mean square, representing the extreme extent of the peak in the waveform. Crest factor C_pThe following equation (3) can be used. Pulse factor C_ifIs the ratio of the signal peak value to the rectified mean (mean of absolute values). Pulse factor C_ifThe following equation (4) can be used.

After obtaining each factor, normalizing each factor by using the transformed frequency domain data (specifically, dividing each factor by the first component of the transformed frequency domain data) to obtain each processed factor, i.e., a processed form factor C_s', kurtosis factor C after treatment_p'and processed pulse factor C'_if。

For a medical data, if the medical data is processedForm factor C of_s'processed kurtosis factor C'_pAnd processed pulse factor C'_ifIf the periodic data condition is met, the medical data is periodic data; if the periodic data condition is not satisfied, the medical data is not periodic data. The periodic data condition can be characterized by equation (5).

In the formula, a₁Is the first coefficient, a₂Is the second coefficient, a₃Is the third coefficient. The first coefficient is less than or equal to the second coefficient, and the third coefficient is greater than the second coefficient. And the sum of the first coefficient, the second coefficient and the third coefficient is less than 1. Illustratively, the first coefficient is equal to 0.2, the second coefficient is equal to 0.2, and the third coefficient is equal to 0.5.

Based on the same idea, the embodiment of the present specification further provides a medical data processing apparatus corresponding to the process shown in fig. 1, and the medical data processing apparatus is shown in fig. 3.

Fig. 3 is a schematic structural diagram of a medical data processing apparatus provided in an embodiment of the present specification, where the medical data processing apparatus may include one or more of the following modules:

an amplification set generation module 300 configured to: respectively amplifying the original sample sets by adopting a plurality of amplification strategies to obtain a plurality of amplification sets;

an evaluation module 302 configured to: evaluating the plurality of amplification sets;

a target policy determination module 304 configured to: selecting a target strategy from a plurality of amplification strategies based on the evaluation;

a training sample set determination module 306 configured to: amplifying the target sample set according to a target amplification strategy to obtain a training sample set;

a training module 308 configured to: training the medical prediction model by utilizing a training sample set;

a medical data acquisition module 310 configured to: acquiring medical data;

a prediction module 312 configured to: and taking the medical data as the input parameter of the trained medical prediction model, executing the medical prediction model and outputting a medical prediction result.

In an alternative embodiment of the present disclosure, the evaluation module 302 is specifically configured to: training a designated model respectively by using the plurality of amplification sets; and obtaining the evaluation results of the multiple amplification sets according to the model performance of the trained designated model.

In an alternative embodiment of the present disclosure, the specified model is trained by using the original sample set.

In an alternative embodiment of the present description, the model performance is characterized by at least one of: accuracy, recall, F1-score.

In an optional embodiment of the present specification, the apparatus further includes an original sample processing module configured to obtain original sample data; if the original sample data presents periodic characteristics, constructing the original sample set according to a set formed by dividing results obtained by dividing the original sample data according to the data period of the original sample data; and if the original sample data does not present periodic characteristics, taking a set formed by the original sample data as the original sample set.

In an alternative embodiment of the present specification, the original sample set comprises positive samples and negative samples, one of the positive samples and the negative samples is a first sample, and the other one of the positive samples and the negative samples is a second sample, and the number of the first samples is smaller than the number of the second samples. The amplification set generation module 300 is specifically configured to: respectively amplifying a first sample set consisting of first samples in the original sample set by adopting the multiple amplification strategies, so that the difference between the number of the first samples in the amplified first sample set and the number of second samples in the original sample set is not greater than a first number threshold; and amplifying a sample set consisting of the amplified first sample and the second sample in the original sample set by adopting the multiple amplification strategies to obtain multiple amplification sets.

In an alternative embodiment of the present specification, the original sample data in the original sample set is constituted by a time series.

In an alternative embodiment of the present description, the plurality of amplification strategies comprises: an amplification strategy based on an average sequence solution and an amplification strategy based on signal decomposition and reconstruction.

In an alternative embodiment of the present description, the medical data comprises at least one of: heart rate data, pulse wave data, electrocardiogram data, myoelectric data, blood vessel ultrasonic frequency spectrum waves, blood flow data, blood pressure data, blood sugar data, blood oxygen data, body temperature data and blood cell data;

in an alternative embodiment of the present description, the specified model is any one of: a support vector machine, a logistic regression model, a decision tree model, a convolutional neural network model, and an LSTM model.

Embodiments of the present description also provide a computer-readable storage medium storing a computer program, which can be used to execute the process of model training provided in fig. 1.

The embodiment of the present specification also provides a schematic structural diagram of the electronic device shown in fig. 4. As shown in fig. 4, at the hardware level, the electronic device may include a processor, an internal bus, a network interface, a memory, and a non-volatile memory, and may also include hardware required for other services. The processor reads the corresponding computer program from the nonvolatile memory into the memory and then runs the computer program to realize the process of training any model.

Of course, besides the software implementation, the present specification does not exclude other implementations, such as a combination of logic devices or software and hardware, and the like, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or a logic device.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an Integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Hardware Description Language), traffic, pl (core universal Programming Language), HDCal (jhdware Description Language), lang, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhal (Hardware Description Language), and vhigh-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The controller may be implemented in any suitable manner, for example, the controller may take the form of, for example, a microprocessor or processor and a computer-readable medium storing computer-readable program code (e.g., software or firmware) executable by the (micro) processor, logic gates, switches, an Application Specific Integrated Circuit (ASIC), a programmable logic controller, and an embedded microcontroller, examples of which include, but are not limited to, the following microcontrollers: ARC 625D, Atmel AT91SAM, Microchip PIC18F26K20, and Silicone Labs C8051F320, the memory controller may also be implemented as part of the control logic for the memory. Those skilled in the art will also appreciate that, in addition to implementing the controller as pure computer readable program code, the same functionality can be implemented by logically programming method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers and the like. Such a controller may thus be considered a hardware component, and the means included therein for performing the various functions may also be considered as a structure within the hardware component. Or even means for performing the functions may be regarded as being both a software module for performing the method and a structure within a hardware component.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

For convenience of description, the above devices are described as being divided into various units by function, and are described separately. Of course, the functions of the various elements may be implemented in the same one or more software and/or hardware implementations of the present description.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

As will be appreciated by one skilled in the art, embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, the description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only an example of the present specification, and is not intended to limit the present specification. Various modifications and alterations to this description will become apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present specification should be included in the scope of the claims of the present specification.

Claims

1. A method of medical data processing, comprising:

respectively amplifying the original sample sets by adopting a plurality of amplification strategies to obtain a plurality of amplification sets;

evaluating the plurality of amplification sets;

selecting a target strategy from a plurality of amplification strategies based on the evaluation;

amplifying the target sample set according to a target amplification strategy to obtain a training sample set;

training the medical prediction model by utilizing a training sample set;

acquiring medical data;

and taking the medical data as the input parameter of the trained medical prediction model, executing the medical prediction model and outputting a medical prediction result.

2. The method of claim 1, wherein evaluating the plurality of amplification sets comprises:

training a designated model respectively by using the plurality of amplification sets;

and obtaining the evaluation results of the multiple amplification sets according to the model performance of the trained designated model.

3. The method of claim 2, wherein the method comprises at least one of:

the specified model is obtained by training the original sample set;

4. The method of claim 1, wherein the method further comprises:

obtaining original sample data;

if the original sample data presents periodic characteristics, constructing the original sample set according to a set formed by dividing results obtained by dividing the original sample data according to the data period of the original sample data;

and if the original sample data does not present periodic characteristics, taking a set formed by the original sample data as the original sample set.

5. The method of claim 4, wherein the data period comprises any one of: the physiological cycle of the user to which the original sample data belongs, and the data acquisition cycle when the sample data is acquired.

6. The method of claim 1, wherein the original set of samples contains positive and negative samples, one of the positive and negative samples being a first sample and the other being a second sample, the number of first samples being less than the number of second samples; wherein the method further comprises:

respectively amplifying a first sample set consisting of first samples in the original sample set by adopting the multiple amplification strategies, so that the difference between the number of the first samples in the amplified first sample set and the number of second samples in the original sample set is not greater than a first number threshold;

and amplifying a sample set consisting of the amplified first sample and the second sample in the original sample set by adopting the multiple amplification strategies to obtain multiple amplification sets.

7. The method of claim 1, wherein the method comprises at least one of:

the original sample data in the original sample set is composed of time series;

the specified model is any of: a support vector machine, a logistic regression model, a decision tree model, a convolutional neural network model, and an LSTM model.

8. A model training apparatus, the apparatus comprising:

a medical data acquisition module configured to: acquiring medical data;

9. A computer-readable storage medium, characterized in that the storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1-7.

10. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the method of any of claims 1-7 when executing the program.