CN111863266A

CN111863266A - Method for screening risk factors of sporadic colorectal adenomas based on directional weighted association rule model

Info

Publication number: CN111863266A
Application number: CN202010057865.7A
Authority: CN
Inventors: 余盖青; 高俊波; 程陈; 费若岚; 王长静
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2020-10-30
Anticipated expiration: 2040-01-16
Also published as: CN111863266B

Abstract

The invention discloses a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model, and belongs to the field of data mining. The invention firstly preprocesses the data; then, extracting features by adopting a feature selection method for reducing the average impurity degree of the random forest, and determining optimal division nodes by using information gain to obtain an optimal feature set; then, the preferred feature set is input into a directional weighted association rule model to generate a strong association rule. And finally, bringing the risk factors contained in the strong association rule into a risk factor set and communicating with an expert. Compared with the prior art, the invention mainly provides a directional weighting association rule model to screen the risk factors of colorectal adenomas, confirms the important significance of the life eating habit factors in the etiology of colorectal adenomas, discovers undetected high-risk factors in the previous research and provides a reference method for searching the risk factors of colorectal adenomas.

Description

Method for screening risk factors of sporadic colorectal adenomas based on directional weighted association rule model

Technical Field

The invention relates to medical data analysis, in particular to a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model.

Background

Sporadic colorectal adenomas (CRA) are benign glandular tumors of the colon and rectum, and are the precursor lesions of colorectal cancer. The early detection and timely treatment can effectively reduce the canceration probability of the patient, and have important significance for prolonging the survival time of the patient. Investigation studies have found that CRA is closely related to dietary habits, and 66% to 78% of colorectal adenomas can be avoided by healthy lifestyle habits. However, some important risk factors are still ignored or not discovered, so that the patient cannot be effectively guided to live healthily and the current situation is improved.

In recent years, more and more researchers have become aware of the importance of lifestyle habit factors in the etiology of colorectal adenomas, and have been invested in the study of risk factors for colorectal adenomas. However, the methods are too single in the aspect of risk factor analysis, the traditional methods have certain effects on single factor analysis, but are not perfect enough, and some important risk factors with small probability are easy to miss. In order to overcome the problems, a directional weighting association rule model is provided, and is an efficient association rule mining model constructed by combining a probability calculation weighting support degree and a fixed posterior term mode. The risk factors for colorectal adenomas are analyzed by generating a regular pattern of colorectal adenoma onset.

Disclosure of Invention

The invention aims to: in order to solve the technical problems related to the background art, a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model is provided. The technical scheme adopted by the invention is as follows:

a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model comprises the following specific steps:

s1, preprocessing data;

s2, selecting characteristics by adopting a method of random forest average impure degree reduction;

s3, analyzing by using a directional weighted association rule model;

and S4, incorporating the risk factors contained in the strong association rule generated in S3 into a risk factor set, and communicating with an expert.

As a further technical scheme of the invention: s1, the data preprocessing comprises the following steps:

s101, deleting irrelevant data;

s102, deleting redundant information, deleting characteristic columns with missing values exceeding 50%, and deleting dirty data with obvious abnormality.

S103, data conversion;

as a further technical scheme of the invention: s2, the method for selecting the features by adopting the random forest average impure degree reduction method comprises the following steps:

s201, calculating the information entropy H1 of the original data:

S202, selecting a feature, classifying data according to the feature value, respectively calculating information entropy of each class, and summing the information entropy according to proportion to obtain the information entropy H2 of the division mode;

s203, calculating information gain: info _ gain ═ H1-H2;

and S204, calculating information gains corresponding to all the characteristics according to the S202 and the S203, and reserving the characteristic attribute with larger gain.

And S205, putting the previous features into a set according to the feature index corresponding to the maximum information gain, and taking the previous features as a preferred feature set.

The step S3 includes the following steps:

as a further technical scheme of the invention: the step of S3, analyzing by using the directional weighted association rule model, includes the following steps:

defining: let I ═ I₁,i₂,…,i_mIs the item attribute set. Let D be a collection of transactions T, where T is a collection of item attributes, an

There is a unique identifier corresponding to each transaction T, and it is marked as TID. Let X be a set of items in I, if

Then transaction T is said to contain X.

Defining: item Property i_jIs a value associated with an item property, denoted as w (i)_j). Item Property i_jProbability P (i) of occurrence in transaction set D_j)，w(i_j) Is P (i)_j) The reciprocal of (c). The weight of a patient transaction refers to the weight of a record in the patient data set, denoted as w (T) _k) Is all that belong to T_kThe average value of the weight of the item attribute of (1); wherein T is_kIs the kth record in transaction set D;

formula (1):

formula (2): the weighted support of the association rule A- - > B is denoted wsp (A, B),

formula (3): the confidence of the association rule A- - > B is denoted as conf (A, B):

formula (4): the lifting degree of the association rule A- - > B is marked as lift (A, B), if lift (A, B) >1 indicates A, B is in positive correlation, lift (A, B) <1 indicates A, B is in negative correlation, and lift (A, B) <1 indicates A, B is irrelevant:

s301, scanning the database D to obtain each item attribute i_jAnd calculating to obtain the weight w (T) of the weight_k) (see formula (1) for a specific calculation mode);

s302, scanning a database D, setting a pathology as a post item after _ item of an association rule, and putting all other features into a set Q; setting a minimum support threshold value min _ sup, a minimum confidence threshold value min _ conf and a maximum cycle number max _ rule _ length;

and S303, initializing a frequent 1-item set. All items in the Q are connected with the back item after _ item, and items with the weighting support degree greater than min _ sup are selected and put into the L0 (the weighting support degree is calculated and shown in a formula (2));

and S304, generating a frequent (k +1) -item set by utilizing the frequent k-item set. The core method is a recursion method based on a frequency set theory, firstly generating a frequent 1-item set L1, and then generating a frequent 2-item set L2 until the maximum length r of a generation rule generates Lr, and then stopping the algorithm. Here in the kth cycle, the process first generates a set Ck of candidate k-term sets, each term set in Ck being generated by Lk-1 as a self-join. The set of terms in Ck is a candidate set for generating the frequent item set, and the final frequent item set Lk must be a subset of Ck. The mode of generating Lk by Ck is as follows: and calculating the weighted support sup1 of each item in the Ck and the weighted support sup2 of each item after the after _ item is removed, and putting the items with the weighted support sup1 larger than min _ sup into L (k + 1).

S305、L＝[L2,…,Lr]Calculating the ratio conf (the calculation mode is shown in formula (3)) and the lift (the calculation mode is shown in formula (4)) of the weighting support degree of each frequent item set (L-after _ item) and after _ item in the L, and outputting a strong association rule if conf is more than min _ conf

Compared with the prior art, the invention has the advantages or positive effects

1. The invention provides a directional weighting association rule model-based risk factor screening method for sporadic colorectal adenomas.

2. The method and the device preferentially construct the optimal feature set, and are favorable for improving the accuracy of the analysis result and shortening the calculation process.

3. The invention aims at the life eating habit data, analyzes the high-risk factor of the colorectal adenoma by excavating the incidence relation between the life eating habit data and the colorectal adenoma, and provides a set of reference worthy method for screening the risk factor of the colorectal adenoma.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a flow chart of feature selection in accordance with the present invention;

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

In a first embodiment, referring to fig. 1, a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model includes the following specific steps:

and S1, preprocessing the data of the colorectal adenoma. Deleting irrelevant data, deleting redundant information, deleting characteristic columns with missing values exceeding 50%, and deleting dirty data with obvious abnormality. A total of 234 were included in the standard dataset, of which 62 were diagnosed with colorectal adenomas.

Referring to fig. 2, feature selection is performed by using a random forest average impure degree reduction method.

(1) Calculating the information entropy of the original data to obtain the initial information entropy as follows:

(2) the entropy H2 is determined, for example, by classifying according to features 7 and 24:

h2 (classified by feature 7) ═ 0.8283984298779227;

h2 (classified by feature 24) 0.7903757392936914.

(3) The information gain info _ gain is calculated, classified as an example according to features 7 and 24:

info _ gain (classified by feature 7) ═ H1-H2 (classified by feature 7) ═ 0.8341351937-0.7903757392 ═ 0.0057367638;

info _ gain (classified by feature 24) ═ H1-H2 (classified by feature 24) ═ 0.8341351937-0.8283984298 ═ 0.0437594544.

(4) The characteristic attribute with larger gain is reserved, and the characteristic index corresponding to the optimal information gain is obtained as 24. The preferred feature set is the 24 features ranked 24 top in feature importance.

S3, analyzing by using a directional weighted association rule model;

through repeated experiments, experimental parameters are selected, the maximum mining item is 5, the subsequent item is 'bq _ 1' (the pathology is 1, namely colorectal adenoma is suffered), the minimum weighting support degree is 0.3, and the minimum confidence degree is 0.5. Inputting the preferred index set into the directional weighted association rule model generates 44 rule patterns of colorectal adenoma onset.

And S4, incorporating the risk factors contained in the strong association rule generated in S3 into a risk factor set, and communicating with an expert. The 44 rules contain 7 important features, and traditional risk factors and non-traditional risk factors are contained in the 7 important features, so that the effectiveness and the correctness of the method are proved.

The above description is only exemplary of the present invention and should not be taken as limiting the invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model comprises the following specific steps:

s1, preprocessing data and data;

S2, selecting characteristics by adopting a random forest average impure degree reducing method to obtain an optimal index set;

s3, analyzing by using a directional weighted association rule model;

s4, bringing the risk factors contained in the strong association rules generated in S3 into a risk factor set, and communicating with experts;

the data in the step S1 includes the following data segment items:

the data column field comprises 79 risk factor characteristics which are preliminarily screened by experts and are related to the life eating habits in five aspects of basic information, disease states, life habits, eating habits and colonoscopy results;

wherein the basic information includes: name, gender, age, race, phone number, education level, local residents, height, weight, occupation, marital status, household income;

(1) the disease conditions include (i) past medical history: history of diabetes, history of hypertension, history of coronary heart disease, history of chronic liver disease, history of chronic kidney disease, history of chronic bronchitis, history of cerebrovascular disease, history of hyperlipidemia, history of fatty liver disease, history of cholecystectomy, history of intestinal surgery, history of gastric surgery, history of esophageal surgery, history of other diseases or operations; second, the current medical history: abdominal pain, abdominal distension, diarrhea, constipation, bloody stools, mucous stools, other symptoms; thirdly, antibiotics are used;

(2) The living habits comprise: smoking, staying up, exercising and going out;

(3) the eating habits include (i) frequency and cooking profile of the seawater product: processed seawater plants such as cooked fresh seawater fish, uncooked fresh frozen fish fillet, pickled seawater fish and dried fish, spicy seawater fish and dried fish, cooked fresh seawater shrimp/crab/shellfish/snail, uncooked fresh frozen shrimp/crab/shellfish/snail, pickled seawater shrimp/crab/shellfish/snail, liquor-preserved seawater shrimp/crab/shellfish/snail, seawater plant, pickled seawater plant, etc.; ② the frequency and cooking processing mode of the livestock and poultry meat: newly slaughtered pork/cattle/sheep/chicken/duck meat, newly slaughtered animal viscera, cured processed meat product, barbecued processed meat product, smoked processed meat product, and spicy processed meat product; ③ the frequency and cooking method of the fresh water products: fresh freshwater fish, pickled freshwater fish, spicy freshwater fish, fresh freshwater shrimps/crabs/shellfish/snails, pickled freshwater shrimps/crabs/shellfish/snails, and drunk freshwater shrimps/crabs/shellfish/snails; poultry eggs/milk and dairy products: processed poultry eggs such as plain milk, low-fat/skim milk, yogurt, milk powder, chicken eggs/duck eggs/quail eggs, marinated, and the like; snack type: processed carbohydrates, processed meat, processed preserved fruit; sixthly, the vegetables/melons and fruits and the cooking processing mode are as follows: fresh vegetables, pickled processed vegetables, mushrooms, melons and fresh fruits; seventh, drinking water and beverage: drinkable tap water, drinkable mineral water, drinkable purified water, carbonated beverages, fruit juice beverages; eighty drinks: mixing low-degree Chinese liquor, high-degree Chinese liquor, red wine, yellow wine, beer, fruit wine, alcoholic beverage, and various wines;

(4) The colonoscopy results include: examination results, examination sites and pathological results. The pathological outcome is used to determine whether they are colorectal adenoma patients;

the data preprocessing in the step S1 includes the following steps:

s101, deleting irrelevant data;

s102, deleting redundant information, deleting characteristic columns with missing values exceeding 50%, and deleting dirty data with obvious abnormality;

s103, data conversion;

the step S2 of selecting the features by adopting the method of reducing the average mean purity of the random forest comprises the following steps:

s201, calculating the information entropy H1 of the original data:

s203, calculating information gain: info _ gain ═ H1-H2;

s204, calculating information gains corresponding to all the characteristics according to S202 and S203, and reserving the characteristic attribute with larger gain;

s205, putting the previous features into a set according to the feature index corresponding to the maximum information gain, and taking the previous features as a preferred feature set;

the step S3 includes the following steps:

defining: let I ═ I₁,i₂,…,i_mIs an item attribute set, denoted D as the set of transactions T, where T is the set of item attributes, and

Having a unique identifier corresponding to each transaction T, denoted TID, and if X is a set of entries in I, if

Then the transaction T is said to contain X;

item Property i_jIs a value associated with an item property, denoted as w (i)_j). Item Property i_jProbability P (i) of occurrence in transaction set D_j)，w(i_j) Is P (i)_j) The reciprocal of (a); the weight of a patient transaction refers to the weight of a record in the patient data set, denoted as w (T)_k) Is all that belong to T_kThe average value of the weight of the item attribute of (1); wherein T is_kIs the kth record in transaction set D;

formula (1):

s301, scanning the data table D to obtain each item attribute i_jAnd calculating the weight w (T) through the formula (1)_k)；

S302, scanning a data table D, setting the pathology as a post item after _ item of an association rule, and putting all other features into a set Q; setting a minimum support threshold value min _ sup, a minimum confidence threshold value min _ conf and a maximum cycle number max _ rule _ length;

S303, initializing a frequent 1-item set: all items in the Q are connected with the back item after _ item, and the items with the weighting support degree greater than min _ sup are selected and put into the L0, wherein the weighting support degree is calculated through a formula (2);

s304, generating a frequent (k +1) -item set by using the frequent k-item set: firstly, generating a frequent 1-item set L1, and then generating a frequent 2-item set L2 until the maximum length r of a generation rule generates Lr, and stopping the algorithm; the way of generating Lk by Ck is as follows: calculating the weighted support sup1 of each item in the Ck and the weighted support sup2 of each item after the after _ item is removed, and putting the items with the weighted support sup1 larger than min _ sup into L (k + 1);

S305、L＝[L2,…,Lr](ii) a Calculating the ratio conf of the weighted support degree of each frequent item set (L-after _ item) and after _ item in L through formula (3); calculating lift degree lift through a formula (4); if conf is larger than min _ conf, outputting a strong association rule