CN110634565A

CN110634565A - Regression analysis method for medical big data

Info

Publication number: CN110634565A
Application number: CN201910878524.3A
Authority: CN
Inventors: 梅明亮
Original assignee: Anhui Wei Aumann Robot Co Ltd
Current assignee: Shenzhen Weike Technology Co ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2019-12-31
Anticipated expiration: 2039-09-18
Also published as: CN110634565B

Abstract

The invention discloses a regression analysis method for medical big data, which comprises the following steps: 1) preprocessing medical big data, determining variables to be predicted according to target tasks, determining input variables, and establishing a training set; 2) initializing the regression model; 3) learning to obtain a support vector regression model in the step g; 4) calculating model weight and sample weight; 5) and (6) integrating and outputting the model. The invention solves the problems of too complex model and larger deviation in the prior art, and has the advantages that: the complexity of the model can be obviously reduced, and the generalization performance of regression can be greatly improved.

Description

Regression analysis method for medical big data

Technical Field

The invention relates to the field of data processing, in particular to a regression analysis method for medical big data.

Background

Currently, there are hundreds of EBs in global medical health data and are increasing at an accelerated rate. Big data is changing medical research and practice from the rapid identification and establishment of large-scale research cohorts to artificial intelligence assisted clinical decision support systems. The medical big data industry develops rapidly by visual angle analysis of the technical category, and the medical information, gene sequencing and healthy intelligent equipment are mainly benefited from three technical progress and marketization level improvement. First, the medical informatization construction level is continuously improved, and systems such as HIS, CIS, PACS and the like are widely applied. CHIMA statistical data shows that the implementation proportion of the information management system of the hospital in China reaches 70-80%, and the information management system is concentrated on three-level medical institutions, so that the accumulation of a large amount of medical data provides a foundation for algorithm construction. Second, second generation gene sequencing technology rapidly reduces sequencing costs from $ 1000 to $ 0.1 ten thousand, and throughput is much higher than first generation sequencing, and increased applications accelerate the accumulation of biological data, bringing value to clinical operations and basic research and development. And thirdly, health management type intelligent hardware, such as intelligent bracelets, watches, body fat scales and other equipment, is rapidly popularized, and can track the health signs of patients in real time and continuously and mine useful data values, so that the development of medical assistance big data is facilitated. In addition, the technologies of data fusion, data visualization, image recognition processing, machine learning, artificial intelligence and the like are continuously improved, and a bottom-layer technical support is provided for the development of medical big data.

Regression analysis is a statistical analysis method for determining the quantitative relationship of interdependence between two or more variables. The application is very wide, and regression analysis is divided into unitary regression analysis and multiple regression analysis according to the number of related variables; according to the number of independent variables, simple regression analysis and multiple regression analysis can be divided; according to the type of relationship between independent variables and dependent variables, linear regression analysis and nonlinear regression analysis can be classified. If a regression analysis includes only one independent variable and one dependent variable and the relationship between the independent variable and the dependent variable can be approximated by a straight line, the regression analysis is called a univariate linear regression analysis. If two or more independent variables are included in the regression analysis and there is a linear correlation between the independent variables, it is referred to as a multiple linear regression analysis. The current relevant studies are shown below. Patent CN201710612240.0 proposes an analysis method for analyzing multimodal big data for medical institutions. Mainly for the analysis of multimodal big data of patients in hospital databases. The method can comprehensively consider the information data of a plurality of modes, effectively avoid the occurrence of the limited condition of a transmission network in the traditional data analysis process, and ensure the real-time feedback of the user information. The established multidimensional partial least square model is combined with a convolutional neural network method, so that the information loss can be reduced, a stable prediction model can be obtained, and a more detailed and accurate analysis report can be provided for a hospital. Patent CN201811570429.9 discloses a big data medical data feature extraction and intelligent analysis prediction method, which specifically includes the following steps: data cleaning, data vectorization, case mining and feature mining, deep neural network model training, disease diagnosis and treatment and cure rate prediction, analysis and verification model. Patent CN201910030377.4 discloses a medical insurance business data accurate analysis system based on big data, which comprises a data source module, a data analysis module and an analysis result output module, and the medical insurance business data accurate analysis system based on big data improves the accuracy of risk prediction through the statistics and analysis of the living habits and medical records of users, greatly helps the work of medical insurance business personnel, and improves the advantage of sales promotion success rate. These patents only show the system framework or use the most recent regression algorithm.

Disclosure of Invention

The invention provides a regression analysis method for medical big data, which solves the problems of too complex model and large deviation in the prior art, and specifically comprises the following steps:

step 1: preprocessing medical big data, determining variables needing prediction and input variables according to a target task, and establishing a training sample set { (x)_i,y_i) 1, 2.., m }, where m represents the total number of training data,is a vector of the d-dimension,

is a scalar quantity, (x)_i,y_i) For the (i) th training sample,

representing a real number domain;

step 2: initializing the regression model

The weight of the ith sample in the G step, G is the maximum training number, δ is the maximum allowable error, let G equal to 1,

m is the total number of samples;

and step 3: learning to obtain a support vector regression model f in the step g^g(x) And calculate f^g(x) Error rate of

Wherein

If r is^g>50% of the total weight of the lubricant

And jumping to step 6;

and 4, step 4: setting model weights

Setting sample weights

Wherein the content of the first and second substances,

and 5: if g is<G, making G increase by 1, jumping to step 2, otherwise, making G increase by 1

Jumping to the step 6;

step 6: an integrated regression model was obtained as follows:

and predicting the data lacking the label by using F (x), wherein x is a sample.

Wherein, the support vector regression model form involved in step 3 is f (x) ═ W^TPhi (x) + b, whichAnd W represents a normal vector, b is a displacement term, phi (x) is a kernel function, x is mapped to other spaces, and the optimal W and b are obtained by optimizing the following functions:

wherein C is>0 is a penalty coefficient, ξ_iAndis a relaxation variable, epsilon>0 is the maximum allowed deviation, the superscript T denotes transposition, and P is the objective function.

Compared with the prior art, the invention has the following advantages: the complexity of the model can be obviously reduced, and the generalization performance of regression can be greatly improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

A regression analysis method for medical big data, as shown in fig. 1, specifically comprising the following steps:

is a scalar quantity, (x)_i,y_i) For the (i) th training sample,

representing a real number domain;

step 2: initializing the regression modelThe weight of the ith sample in the G step, G is the maximum training number, δ is the maximum allowable error, let G equal to 1,

m is the total number of samples;

Wherein

If r is^g>50% of the total weight of the lubricant

And jumping to step 6;

and 4, step 4: setting model weights

Setting sample weights

Wherein the content of the first and second substances,

Jumping to the step 6;

step 6: an integrated regression model was obtained as follows:

Preferably, the support vector regression model involved in step 3 is of the form f (x) ═ W^TPhi (x) + b, where W represents a normal vector, b is a displacement term, phi (x) is a kernel function, x is mapped to other spaces, and optimal W and b are obtained by optimizing the following functions:

wherein C is>0 is a penalty coefficient, ξ_iAnd

is a relaxation variable, epsilon>0 is the maximum allowed deviation, the superscript T denotes transposition, and P is the objective function.

The above examples are provided only for the purpose of describing the present invention, and are not intended to limit the scope of the present invention. The scope of the invention is defined by the appended claims. Various equivalent substitutions and modifications can be made without departing from the spirit and principles of the invention, and are intended to be within the scope of the invention.

Claims

1. A regression analysis method for medical big data is characterized by comprising the following steps:

step 1: preprocessing medical big data, determining variables needing prediction and input variables according to a target task, and establishing a training sample set { (x)_i,y_i) 1, 2.., m }, where m represents the total number of training data,

is a vector of the d-dimension,

is a scalar quantity, (x)_i,y_i) For the (i) th training sample,representing a real number domain;

m is the total number of samples;

Wherein

If r is^g>50% of the total weight of the lubricantAnd jumping to step 6;

and 4, step 4: setting model weights

Setting sample weights

Wherein the content of the first and second substances,

and 5: if g is<G, making G increase by 1, jumping to step 2, otherwise, making G increase by 1Jumping to the step 6;

step 6: an integrated regression model was obtained as follows:

2. The regression analysis method for medical big data according to claim 1, wherein the support vector regression model involved in the step 3 is of the form f (x) W (W ═ W)^TPhi (x) + b, where W represents a normal vector, b is a displacement term, phi (x) is a kernel function, x is mapped to other spaces, and optimal W and b are obtained by optimizing the following functions:

wherein C is>0 is a penalty coefficient, ξ_iAnd