CN107169284A

CN107169284A - A kind of biomedical determinant attribute system of selection

Info

Publication number: CN107169284A
Application number: CN201710332543.7A
Authority: CN
Inventors: 罗森林; 潘丽敏; 张岳峰; 胡雅娴
Original assignee: Beijing Institute of Technology BIT
Current assignee: Beijing Institute of Technology BIT
Priority date: 2017-05-12
Filing date: 2017-05-12
Publication date: 2017-09-15

Abstract

The present invention relates to a kind of biomedical determinant attribute system of selection, belong to field of biomedicine technology.The present invention extracts the important attribute of influence goal in research first by the importance of boruta Algorithm Analysis attribute to be selected；Then attribute construction logic regression model to be selected is used, successive Regression is carried out using AIC criterion, the attribute being had a significant impact to goal in research is obtained；Attribute is obtained for two methods screening, with reference to expert opinion, the method sorted out using occuring simultaneously carries out attribute fusion, obtains final determinant attribute.The present invention is selected the attribute for influenceing goal in research using two distinct methods, and difference is obvious between algorithm, it is to avoid the limitation that single method is brought, and improves the generalization of determinant attribute.

Description

A kind of biomedical determinant attribute system of selection

Technical field

The present invention relates to a kind of biomedical determinant attribute system of selection.Said from the angle of application, belong to biological doctor Learn technical field；For the angle realized from technology, computer science and Bioinformatics technical field are also belonged to.

Background technology

In recent years, biomedical research is developed rapidly, particularly the raising of measuring instrument technology and hospital information system Promote so that a large amount of medical informations can be recorded accurately, so as to cause medical data data explosive increase.But While a large amount of data for enriching complexity bring sufficient data to research work, also the analyzing and processing to researcher is brought more Big challenge.Data mining overall process includes data acquisition, data prediction, knowledge excavation, model evaluation and knowledge application, It can be good at handling mass data, and can be from potentially useful knowledge is extracted, just because of the characteristic, data mining technology is from carrying Go out and be just applied to field of biomedical research, and achieved considerable success.

For field of biomedical research, often cause original without specific specific goal in research during information gathering A large amount of attributes are included in beginning data set, it is necessary to carry out Attributions selection to initial data before data analysis, obtaining one has Representational attribute set, its main purpose has：Uncorrelated attribute, redundant attributes are removed, storage efficiency is improved；Remove synteny Attribute and noise attribute, reduce the interference and influence on data analysis；Improve the Generalization Capability and operational efficiency of model；Obtain more Plus the learning model simply and readily understood, improve the interpretation of model.

Attributions selection is carried out during biomedical research, although had the method for many Attributions selections now, but It is and is not present a kind of attribute selection method suitable for any problem.Feature Selection Algorithm is distinguished according to evaluation criterion, mainly It is divided into two classes：

1. filtering type Attributions selection (Filter)

Filtering type Attributions selection is a kind of higher method of computational efficiency, and it is obtained by the inwardness of data set in itself Evaluation criterion, it is unrelated with specific learning algorithm, with preferable versatility.The evaluation criterion of filtering type Attributions selection is divided into four Class：I.e. distance metric, measure information, the degree of association are measured and consistency metric.

(1) distance metric includes geometric distance measurement and probability metrics measurement.Wherein, the evaluation criterion of geometric distance is usual It is divided into class and class scatter matrix.Scatter Matrix represents that each sample point surrounds the distribution situation of average, class scatter square in class In the distribution situation in space between each quasi-mode of matrix representation.The result of Attributions selection should make the mark of scatter matrix in class smaller more Good, the mark of scatter matrix is the bigger the better between class.Evaluation criterion based on probability metrics has Kullback-Leibler distances, also known as Relative entropy, weighs the difference condition of two probability distribution in similar events space, because it needs the general of each known classification Rate density function, therefore with significant limitation.

(2) measure information is to use the evaluation criterion based on entropy, such as minimum description length (Minimum in information theory Description length), mutual information (Mutual Information), information gain (Information Gain) etc.. These evaluation criterions describe the complexity of attribute, represent the size that attribute includes information content, and Attributions selection often selects complexity More attribute.

(3) correlation mainly investigates the degree of association between attribute, i.e. correlation and redundancy.Wherein linear correlation is wired Property coefficient correlation (Pearson's coefficient and Spearman's correlation coefficient) etc., nonlinear dependence is associated with the mutual information based on comentropy, right Uncertainty of title etc..

(4) consistency metric attempts to find the minimal feature subset with complete or collected works' same category ability, and inconsistency is defined as If in selected character subset value identical sample, but belonging to different classes.

2. packaged type Attributions selection (Wrapper)

Packaged type Attributions selection comes the quality of evaluation attributes subset, that is, package method using the performance indications of learning algorithm Grader is trained using attribute set to be evaluated, the performance further according to grader is evaluated the attribute set.

The package method learning algorithm good and bad to evaluation attributes subset is diversified, is wanted for the learning algorithm used Ask very low, most of sorting algorithm can use package method, for example decision tree, neutral net, Bayes classifier, support to Amount machine and nearest neighbour method etc..

The highly versatile of filtration method, eliminates model training step, and algorithm complex is low, it is adaptable to large-scale dataset, A large amount of uncorrelated attributes can be quickly removed, but this method, independently of specific learning algorithm, classification accuracy is relatively low.And seal The attribute set classification performance that although dress method obtains is preferable, but feature versatility is not strong, and algorithm computation complexity is higher, For large-scale dataset, the execution time of algorithm is very long.

In summary, existing Feature Selection Algorithm selects attribute set only for specific evaluation index, it is impossible to take into account general Property and algorithm complex, it is relatively low to the treatment effeciency of large-scale dataset；And the effect of Attributions selection is carried out to biomedical data Fruit is also not satisfactory, still treats further raising.

The content of the invention

The problem of purpose of the present invention is to solve Attributions selection in biomedical data, proposes a kind of based on boruta- The attribute selection method of logistic regression.

The present invention design principle be：First, using the importance of boruta Algorithm Analysis attribute to be selected, influence is extracted The important attribute of goal in research；Then attribute construction logic regression model to be selected is used, successive Regression is carried out using AIC criterion, Obtain having a significant impact attribute to goal in research；Determinant attribute is obtained for two methods screening, with reference to expert opinion, using friendship The method that collection is sorted out carries out attribute fusion, obtains final determinant attribute.The present invention is studied influence using two distinct methods The attribute of target is selected, and difference is obvious between algorithm, it is to avoid the limitation that single method is brought, and improves the general of determinant attribute The property changed.

The technical scheme is that be achieved by the steps of：

Step 1, data set S includes N number of sample, and the attribute to be selected of M dimension influence diabetes B intervention effects is used Boruta algorithms are fitted to the intervention effect of crowd, obtain influenceing the determinant attribute of intervention effect, and attribute to be selected Importance ranking, concrete methods of realizing is：

Step 1.1, data set S copy attribute is created, rearrangement of going forward side by side obtains recombination data collection S', increases data-oriented The randomness of collection；

Step 1.2, the number for building tree is n (0<N≤N) in data set S', put back to ground using bootstrap modes N number of new sample set is randomly selected, the data volume of each sample set is about 2N/3, and each sample set is registered as Di (0<i≤ N), the data record not being extracted is the outer data Oi (0 of bag<I≤n), build the Random Forest model for including n tree；

Step 1.3, post-class processing is built, calculating each sets the outer data mean square error of corresponding bag, is designated as MSEi, then n The outer data mean square error vector of original bag of tree is [MSE₁,MSE₂,…,MSE_n]；

Step 1.4, the mean square error vector [MSE obtained based on step 2.2₁,MSE₂,…,MSE_n], with corresponding attribute meter Obtained Z values, determine the maximum copy attribute of Z values, while the attribute bigger than its value is filtered out as important attribute, and it is right Answer the small attribute of its Z value to be labeled as insignificant attribute, and deleted from data set；

Step 1.5, above step 1.1 to 1.4 is repeated, until reaching default end condition；

Step 1.6, the mean square error of attribute represents the importance of attribute, the forward attribute of Importance of Attributes sequence, for 2 Patients with type Ⅰ DM intervention effect has the attribute of material impact.

Step 2, based on raw data set S, discrimination model is returned according to the intervention effect construction logic of crowd, logic is used Regression algorithm is fitted to the intervention effect of crowd, obtains having the attribute significantly affected, the side of implementing to intervention effect Method is：

Step 2.1, tie up attribute to be selected to M to be normalized, computational methods are：

Wherein,It is i-th of sample l dimension attribute original value,It is the normalized value of i-th of sample l dimension attribute,WithIt is then the maximum and minimum value of sample l dimension attributes respectively；

Step 2.2, construction logic regression model, each attribute coefficients, regression equation are calculated using maximum Likelihood It is as follows：

F (x)=b₀+b₁x¹+b₂x²+...+b_Mx^M

Wherein, x^l(0<L≤M) l dimension attributes are represented, M is the dimension of attribute, b_l(0<L≤M) represent Logic Regression Models In each attribute weight；

Step 2.3, successive Regression is carried out using AIC criterion, obtains combinations of attributes during AIC minimums, construction logic is returned Model；

Step 2.4, significance test is carried out to M dimension attributes weight in Logic Regression Models, using significance as 0.05, Screen has the attribute significantly affected to intervention effect.

Step 3, obtain patrolling in the attribute that there is intervention effect material impact, step 2 based on boruta algorithms in step 1 Collect regression algorithm and obtain that there is the attribute significantly affected on intervention effect, with reference to expert opinion, the method sorted out using occuring simultaneously is obtained To the determinant attribute of influence intervention effect, concrete methods of realizing is：

Step 3.1, with reference to expertise opinion, it is considered to the collection difficulty of attribute, selection in step 1 and step 2 is obtained Attribute further screen；

Step 3.2, the attribute to intervention effect with material impact, and logistic regression algorithm are obtained based on boruta algorithms Obtain that there is the attribute significantly affected on intervention effect, the attribute that selection occurs simultaneously, crucial as influence intervention effect belongs to Property.

Beneficial effect

Biomedical determinant attribute system of selection proposed by the present invention based on boruta algorithms and logistic regression algorithm, leads to Cross logistic regression and obtain effect degree of the different attribute to intervention effect, selection is with the attribute significantly affected, and boruta is calculated Method can quantify effect sensitivity level of the different attribute to intervention effect, and the determinant attribute that comprehensive two methods are obtained both ensured The credibility of determinant attribute, causes determinant attribute to have difference between material impact, algorithm obvious to intervention effect again, it is to avoid The limitation that single method is brought, instructs measure to provide instruction for specific aim adjustment intervention.

Brief description of the drawings

Fig. 1 is biomedical determinant attribute system of selection schematic diagram proposed by the present invention；

Fig. 2 is in embodiment, the Importance of Attributes based on boruta algorithms sorts.

Embodiment

In order to better illustrate objects and advantages of the present invention, with reference to the accompanying drawings and examples to the reality of the inventive method The mode of applying is described in further details.

All tests are completed on same computer below, and concrete configuration is：Intel double-core CPU (dominant frequency 2.53G), 4G internal memories, the operating systems of Windows 7.

Test is strengthened using data source from the diabetes B high-risk individuals cooperated in Beijing Hospital's Gerontological Research Center Lifestyle modification manages data, and group state that enters to interviewee in intervention group carries out RSD judgements, selects into group people at highest risk's number According to totally 452 samples.

RSD risk status judgements are carried out again to intervening data after half a year, and it is to be divided into 0 class, risk status to keep high-risk What is declined is divided into 1 class (i.e. intervening measure is effective), and as the label of two classification based training data sources, totally 20 dimension different attribute, is wrapped Including 7 dimension intervening measures, (total run duration, effective exercise amount, effective exercise time, effective exercise number of times, effective dose are than body weight, reality Border intake, aequum), 10 tie up (body weight, BMI, waistline, systolic pressure, diastolic pressure, blood glucose, cholesterol, sweet into a group body index Oily three esters, HDL, low-density lipoprotein), 3-dimensional basic index (age, sex, Diabetes family history).

1st, the important attribute based on boruta algorithms

By the intervention time of half a year different changes occur for people at highest risk, by boruta algorithms, according to importance ranking As a result the important attribute of selection influence intervention effect, schematic diagram is as shown in figure 1, specific implementation step is as follows：

Step 1, the data of variable are replicated, builds and resets copy attribute, growth data collection is obtained；

Step 2, based on growth data collection, using bootstrap mode sample drawn collection, the conduct bag not being drawn into is outer Data, build random forest grader；

Step 3, post-class processing is trained, and calculates the mean square error MSE of the outer data of each tree correspondence bag₁,MSE₂,…, MSE_i, wherein (0<I≤n), then the outer data mean square error vector of the original bag of m tree can be expressed as [MSE₁,MSE₂,…, MSE_n]；

Step 4, data mean square error vector [MSE outside the original bag obtained based on step 3₁,MSE₂,…,MSE_n], calculate The Z values of correspondence attribute, the maximum copy attribute of screening Z values, and the attribute bigger than the attribute value, fusion obtain important category Property, the attribute smaller than the attribute value is classified as insignificant attribute, insignificant attribute and copy attribute is deleted；

Step 5, step 1 is repeated to step 4, untill reaching preset termination condition；

Step 6, according to the result of calculation of boruta algorithms, the mean square error of attribute represents the importance of attribute, attribute weight The forward attribute of the property wanted sequence, to have the attribute of material impact to diabetes B intervention effect.

When intervening half a year, 20 dimension attributes carry out Importance of Attributes sequence according to data mean square error outside bag, as a result see Fig. 2.

Compare the Z values of attribute by iteration, it is final to determine that 5 tie up important attributes, it is that effective exercise amount, effective dose compare body respectively Weight, waistline, effective exercise time and BMI, the tentative uncertainty attribute of 1 dimension, is low-density lipoprotein, and remaining 14 dimension attribute is not weigh Want attribute.

2nd, the notable attribute that logic-based is returned

Different intervention time crowd situations of change are directed to respectively, and by logistic regression algorithm, weight represents each attribute to dry The influence degree of pre- effect, sig table shows the result significantly examined, and as sig≤0.05, represents that the attribute has aobvious to intervention effect Work property influence, specific implementation step is as follows：

Step 1, attribute to be selected is normalized, eliminates influence of the different dimensions of attribute to result；

Step 2, construction logic regression model, each attribute coefficients are calculated using maximum Likelihood；

Step 3, successive Regression is carried out using AIC criterion, obtains combinations of attributes during AIC minimums, construction logic returns mould Type；

Step 4, significance test is carried out to attribute weight in Logic Regression Models, using significance as 0.05, screening There is the attribute significantly affected on intervention effect.

Experimental result, is shown in Table 1：

The logistic regression Importance of Attributes ranking results of table 1

According to experimental result, it is defined by significance for 0.05, selection effective dose is than body weight, effective exercise time, BMI For notable attribute.

3rd, attribute is merged

The attribute to intervention effect with material impact is obtained based on boruta algorithms, and logistic regression algorithm is obtained to dry Pre- effect has the attribute significantly affected, selects the attribute occurred simultaneously, as the determinant attribute of influence intervention effect, final true Effective dose is determined than the determinant attribute of body weight, effective exercise time, BMI for influence intervention effect.

Claims

1. a kind of biomedical determinant attribute system of selection, it is characterised in that the described method comprises the following steps：

Step 1, data set S includes N number of sample, and the attribute to be selected of M dimension influence diabetes B intervention effects is calculated using boruta Method is fitted to the intervention effect of crowd, obtains influenceing the determinant attribute of intervention effect, and the importance of attribute to be selected to arrange Sequence；

Step 2, based on raw data set S, discrimination model is returned according to the intervention effect construction logic of crowd, logistic regression is used Algorithm is fitted to the intervention effect of crowd, obtains having the attribute significantly affected to intervention effect；

Step 3, logic in the attribute that there is intervention effect material impact, step 2 is obtained based on boruta algorithms in step 1 to return Reduction method obtains having the attribute significantly affected to intervention effect, with reference to expert opinion, and the method sorted out using occuring simultaneously obtains shadow Ring the determinant attribute of intervention effect.

2. according to the method described in claim 1, it is characterised in that the step of the use boruta algorithms selection important attributes Specifically include：

Step 2.1, data set S copy attribute is created, rearrangement of going forward side by side obtains recombination data collection S, increase data-oriented collection Randomness；

Step 2.2, the number for building tree is n (0<N N) in data set S, taken out at random with putting back to using bootstrap modes N number of new sample set is taken, the data volume of each sample set is about 2N/3, and each sample set is registered as Di (0<I n), do not have The data record being extracted is the outer data Oi (0 of bag<In), the Random Forest model for including n tree is built；

Step 2.3, post-class processing is built, calculating each sets the outer data mean square error of corresponding bag, is designated as MSEi, then n tree The outer data mean square error vector of original bag be [MSE1, MSE2 ..., MSEn]；

Step 2.4, the mean square error vector [MSE1, MSE2 ..., MSEn] obtained based on step 2.2, is calculated with corresponding attribute The Z values arrived, determine the maximum copy attribute of Z values, while filtering out the attribute bigger than its value as important attribute, and correspond to it The small attribute of Z values is labeled as insignificant attribute, and is deleted from data set；

Step 2.5, above step 1.1 to 1.4 is repeated, until reaching default end condition；

Step 2.6, the mean square error of attribute represents the importance of attribute, the forward attribute of Importance of Attributes sequence, for 2 types sugar The sick intervention effect of urine has the attribute of material impact.

3. according to the method described in claim 1, it is characterised in that have the step of attribute notable using logistic regression algorithms selection Body includes：

Step 3.1, tie up attribute to be selected to M to be normalized, computational methods are：

Step 3.2, construction logic regression model, each attribute coefficients are calculated using maximum Likelihood, and regression equation is as follows：

F (x)=b₀+b₁x¹+b₂x²+...+b_Mx^M

Wherein, x^l(0<L≤M) l dimension attributes are represented, M is the dimension of attribute, b_l(0<L≤M) represent each in Logic Regression Models The weight of attribute；

Step 3.3, successive Regression is carried out using AIC criterion, obtains combinations of attributes during AIC minimums, construction logic regression model；

Step 3.4, significance test is carried out to M dimension attributes weight in Logic Regression Models, using significance as 0.05, screening There is the attribute significantly affected on intervention effect.

4. according to the method described in claim 1, it is characterised in that the step of attribute fusion obtains determinant attribute specifically includes：

Step 4.1, with reference to expertise opinion, it is considered to the collection difficulty of attribute, to the category for selecting to obtain in step 1 and step 2 Property is further screened；

Step 4.2, the attribute to intervention effect with material impact is obtained based on boruta algorithms, and logistic regression algorithm is obtained There is the attribute significantly affected on intervention effect, the attribute that selection occurs simultaneously is used as the determinant attribute of influence intervention effect.