Regression analysis method for medical big data
Technical Field
The invention relates to the field of data processing, in particular to a regression analysis method for medical big data.
Background
Currently, there are hundreds of EBs in global medical health data and are increasing at an accelerated rate. Big data is changing medical research and practice from the rapid identification and establishment of large-scale research cohorts to artificial intelligence assisted clinical decision support systems. The medical big data industry develops rapidly by visual angle analysis of the technical category, and the medical information, gene sequencing and healthy intelligent equipment are mainly benefited from three technical progress and marketization level improvement. First, the medical informatization construction level is continuously improved, and systems such as HIS, CIS, PACS and the like are widely applied. CHIMA statistical data shows that the implementation proportion of the information management system of the hospital in China reaches 70-80%, and the information management system is concentrated on three-level medical institutions, so that the accumulation of a large amount of medical data provides a foundation for algorithm construction. Second, second generation gene sequencing technology rapidly reduces sequencing costs from $ 1000 to $ 0.1 ten thousand, and throughput is much higher than first generation sequencing, and increased applications accelerate the accumulation of biological data, bringing value to clinical operations and basic research and development. And thirdly, health management type intelligent hardware, such as intelligent bracelets, watches, body fat scales and other equipment, is rapidly popularized, and can track the health signs of patients in real time and continuously and mine useful data values, so that the development of medical assistance big data is facilitated. In addition, the technologies of data fusion, data visualization, image recognition processing, machine learning, artificial intelligence and the like are continuously improved, and a bottom-layer technical support is provided for the development of medical big data.
Regression analysis is a statistical analysis method for determining the quantitative relationship of interdependence between two or more variables. The application is very wide, and regression analysis is divided into unitary regression analysis and multiple regression analysis according to the number of related variables; according to the number of independent variables, simple regression analysis and multiple regression analysis can be divided; according to the type of relationship between independent variables and dependent variables, linear regression analysis and nonlinear regression analysis can be classified. If a regression analysis includes only one independent variable and one dependent variable and the relationship between the independent variable and the dependent variable can be approximated by a straight line, the regression analysis is called a univariate linear regression analysis. If two or more independent variables are included in the regression analysis and there is a linear correlation between the independent variables, it is referred to as a multiple linear regression analysis. The current relevant studies are shown below. Patent CN201710612240.0 proposes an analysis method for analyzing multimodal big data for medical institutions. Mainly for the analysis of multimodal big data of patients in hospital databases. The method can comprehensively consider the information data of a plurality of modes, effectively avoid the occurrence of the limited condition of a transmission network in the traditional data analysis process, and ensure the real-time feedback of the user information. The established multidimensional partial least square model is combined with a convolutional neural network method, so that the information loss can be reduced, a stable prediction model can be obtained, and a more detailed and accurate analysis report can be provided for a hospital. Patent CN201811570429.9 discloses a big data medical data feature extraction and intelligent analysis prediction method, which specifically includes the following steps: data cleaning, data vectorization, case mining and feature mining, deep neural network model training, disease diagnosis and treatment and cure rate prediction, analysis and verification model. Patent CN201910030377.4 discloses a medical insurance business data accurate analysis system based on big data, which comprises a data source module, a data analysis module and an analysis result output module, and the medical insurance business data accurate analysis system based on big data improves the accuracy of risk prediction through the statistics and analysis of the living habits and medical records of users, greatly helps the work of medical insurance business personnel, and improves the advantage of sales promotion success rate. These patents only show the system framework or use the most recent regression algorithm.
Disclosure of Invention
The invention provides a regression analysis method for medical big data, which solves the problems of too complex model and large deviation in the prior art, and specifically comprises the following steps:
step 1: preprocessing medical big data, determining variables needing prediction and input variables according to a target task, and establishing a training sample set { (x)
i,y
i) 1, 2.., m }, where m represents the total number of training data,
is a vector of the d-dimension,
is a scalar quantity, (x)
i,y
i) For the (i) th training sample,
representing a real number domain;
step 2: initializing the regression model
The weight of the ith sample in the G step, G is the maximum training number, δ is the maximum allowable error, let G equal to 1,
m is the total number of samples;
and step 3: learning to obtain a support vector regression model f in the step g
g(x) And calculate f
g(x) Error rate of
Wherein
If r is
g>50% of the total weight of the lubricant
And jumping to step 6;
and 4, step 4: setting model weights
Setting sample weights
Wherein the content of the first and second substances,
and 5: if g is<G, making G increase by 1, jumping to step 2, otherwise, making G increase by 1
Jumping to the step 6;
step 6: an integrated regression model was obtained as follows:
and predicting the data lacking the label by using F (x), wherein x is a sample.
Wherein, the support vector regression model form involved in step 3 is f (x) ═ WTPhi (x) + b, whichAnd W represents a normal vector, b is a displacement term, phi (x) is a kernel function, x is mapped to other spaces, and the optimal W and b are obtained by optimizing the following functions:
wherein C is>0 is a penalty coefficient, ξiAndis a relaxation variable, epsilon>0 is the maximum allowed deviation, the superscript T denotes transposition, and P is the objective function.
Compared with the prior art, the invention has the following advantages: the complexity of the model can be obviously reduced, and the generalization performance of regression can be greatly improved.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.
A regression analysis method for medical big data, as shown in fig. 1, specifically comprising the following steps:
step 1: preprocessing medical big data, determining variables needing prediction and input variables according to a target task, and establishing a training sample set { (x)
i,y
i) 1, 2.., m }, where m represents the total number of training data,
is a vector of the d-dimension,
is a scalar quantity, (x)
i,y
i) For the (i) th training sample,
representing a real number domain;
step 2: initializing the regression model
The weight of the ith sample in the G step, G is the maximum training number, δ is the maximum allowable error, let G equal to 1,
m is the total number of samples;
and step 3: learning to obtain a support vector regression model f in the step g
g(x) And calculate f
g(x) Error rate of
Wherein
If r is
g>50% of the total weight of the lubricant
And jumping to step 6;
and 4, step 4: setting model weights
Setting sample weights
Wherein the content of the first and second substances,
and 5: if g is<G, making G increase by 1, jumping to step 2, otherwise, making G increase by 1
Jumping to the step 6;
step 6: an integrated regression model was obtained as follows:
and predicting the data lacking the label by using F (x), wherein x is a sample.
Preferably, the support vector regression model involved in step 3 is of the form f (x) ═ WTPhi (x) + b, where W represents a normal vector, b is a displacement term, phi (x) is a kernel function, x is mapped to other spaces, and optimal W and b are obtained by optimizing the following functions:
wherein C is>0 is a penalty coefficient, ξ
iAnd
is a relaxation variable, epsilon>0 is the maximum allowed deviation, the superscript T denotes transposition, and P is the objective function.
The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.