CN111863266B

CN111863266B - Dangerous factor screening method for sporadic colorectal adenoma based on directional weighted association rule model

Info

Publication number: CN111863266B
Application number: CN202010057865.7A
Authority: CN
Inventors: 余盖青; 高俊波; 程陈; 费若岚; 王长静
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2020-01-16
Filing date: 2020-01-16
Publication date: 2023-09-19
Anticipated expiration: 2040-01-16
Also published as: CN111863266A

Abstract

The invention discloses a dangerous factor screening method for sporadic colorectal adenoma based on a directional weighted association rule model, and belongs to the field of data mining. The invention pre-processes the data; then, carrying out feature extraction by adopting a feature selection method for reducing average non-purity of a random forest, and determining optimal dividing nodes by adopting information gain to obtain a preferred feature set; next, the preferred feature set is input into a directionally weighted association rule model to generate a strong association rule. Finally, the risk factors contained in the strong association rules are brought into the risk factor set and communicated with the expert. Compared with the prior art, the method mainly provides a directional weighted association rule model to screen the risk factors of colorectal adenoma, confirms the significance of life eating habit factors in the etiology of colorectal adenoma, discovers the high risk factors which are not discovered in the previous research, and provides a method worthy of reference for searching the risk factors of colorectal adenoma.

Description

Dangerous factor screening method for sporadic colorectal adenoma based on directional weighted association rule model

Technical Field

The invention relates to medical data analysis, in particular to a risk factor screening method for sporadic colorectal adenoma based on a directional weighted association rule model.

Background

Sporadic colorectal adenomas (CRAs) are benign glandular tumors of the colon and rectum, a pre-lesion of colorectal cancer. Early detection and timely treatment can effectively reduce the canceration probability of the patients, and has important significance for prolonging the survival time of the patients. Investigation and research show that CRA is closely related to life eating habit, and 66% -78% of colorectal adenomas can be avoided through healthy life habit. However, some important risk factors are ignored or even not found, so that the health life of the patient cannot be effectively guided, and the current situation is improved.

In recent years, more and more researchers have come to appreciate the importance of eating habits in the etiology of colorectal adenoma and have been devoted to the study of risk factors for colorectal adenoma. However, the method is too single in the aspect of analysis of the risk factors, and the traditional methods have a certain effect on single factor analysis, but are not perfect, and some risk factors with small probability but important are easily omitted. To overcome the above problem, we propose a directional weighted association rule model, which is an efficient association rule mining model constructed by combining the way of probability calculation of weighted support and fixed postamble. The risk factors for colorectal adenomas are analyzed by generating a regular pattern of colorectal adenoma onset.

Disclosure of Invention

The invention aims at: in order to solve the technical problems related to the background technology, a risk factor screening method for sporadic colorectal adenoma based on a directional weighted association rule model is provided. The technical scheme adopted by the invention is as follows:

a risk factor screening method for sporadic colorectal adenomas based on a directionally weighted association rule model comprises the following specific steps:

s1, preprocessing data;

s2, selecting characteristics by adopting a method for reducing average non-purity of random forests;

s3, analyzing by using a directional weighted association rule model;

and S4, incorporating the risk factors contained in the strong association rule generated in the step S3 into a risk factor set and communicating with an expert.

As a still further technical scheme of the invention: the S1, data preprocessing comprises the following steps:

s101, deleting irrelevant data;

s102, deleting redundant information, deleting characteristic columns with deletion values exceeding 50%, and deleting dirty data with obvious anomalies.

S103, data conversion;

as a still further technical scheme of the invention: s2, performing feature selection by adopting a random forest average non-purity reduction method comprises the following steps:

s201, calculating information entropy H1 of original data:

s202, selecting a feature, classifying data according to the feature value, calculating information entropy of each class respectively, and summing the information entropy according to proportion to obtain information entropy H2 of the division mode;

s203, calculating information gain: info_gain=h1-H2;

s204, calculating information gains corresponding to all the features according to S202 and S203, and reserving feature attributes with larger gains.

S205, according to the feature index corresponding to the maximum information gain, the previous features are put into a set to be used as a preferred feature set.

The step S3 includes the steps of:

as a still further technical scheme of the invention: s3, analyzing by using a directional weighted association rule model comprises the following steps:

definition: let i= { I ₁ ,i ₂ ,…,i _m And is a set of item attributes. Notation D is a set of transactions T, where T is a set of item attributes, andthere is a unique identification for each transaction T, denoted TID. Let X be a collection of items in I, ifThen transaction T is said to contain X.

Definition: item attribute i _j The weight of (a) is a value related to the item property, denoted w (i _j ). Item attribute i _j Probability of occurrence in transaction set D P (i _j )，w(i _j ) Namely P (i) _j ) Is the inverse of (c). The weight of a patient transaction refers to the weight of a record in the patient data set, denoted w (T _k ) Is all of T _k The average value of the weights of the item attributes; wherein T is _k Is the kth record in transaction set D;

formula (1):

；

formula (2): the weighted support of association rule a- > B is denoted wsp (a, B),

equation (3): the confidence level of the association rule a— > B is denoted conf (a, B):

equation (4): the degree of promotion of association rule a— > B is denoted as lift (a, B), if lift (a, B) >1 indicates A, B is positively correlated, lift (a, B) <1 indicates A, B is negatively correlated, lift (a, B) =1 indicates A, B is uncorrelated:

s301, scanning a database D to obtain each item attribute i _j And calculate the probability of w (T) _k ) (the specific calculation mode is shown in the formula (1));

s302, scanning a database D, setting pathology as a post term after_item of an association rule, and putting all other features into a set Q; setting a minimum support threshold value min_sup, a minimum confidence threshold value min_conf and a maximum circulation number max_rule_length;

s303, initializing frequent 1-item sets. All items in Q are connected with the later term after_item, and the items with the weighted support degree larger than min_sup are selected to be put into L0 (the weighted support degree calculation is shown in a formula (2));

s304, generating frequent (k+1) -item sets by using the frequent k-item sets. The core method is a recursive method based on a frequency set theory, firstly, a frequent 1-item set L1 is generated, a frequent 2-item set L2 is generated again, and the algorithm is stopped until a rule maximum length r is generated to generate Lr. Here, in the kth cycle, the process first generates a set Ck of candidate k-term sets, each term set in Ck being generated by doing a self-join by Lk-1. The term set in Ck is a candidate set used to generate frequent term sets, and the last frequent term set Lk must be a subset of Ck. Wherein the Lk mode is generated by Ck: and calculating the weighted support sup1 of each item in Ck and the weighted support sup2 of the item after each item is removed from the after_item, and putting the item with the weighted support sup1 larger than the min_sup into L (k+1).

S305、L＝[L2,…,Lr]Calculating the ratio conf (the calculation mode is shown in formula (3)) and the lifting degree lift (the calculation mode is shown in formula (4)) of the weighted support degree of each frequent item set (L-after_item) and after_item in L, and outputting a strong association rule if conf is larger than min_confCompared with the prior art, the invention has the advantages or positive effects

1. The invention provides a dangerous factor screening method for sporadic colorectal adenoma based on a directional weighted association rule model, which improves a support degree calculation mode and a postterm generation mode, is beneficial to reducing invalid calculation, improving the generation of effective rules and improving the mining effect.

2. The invention constructs the preferential feature set, which is beneficial to improving the accuracy of analysis results and shortening the calculation process.

3. Aiming at life eating habit data, the invention analyzes the high risk factor of colorectal adenoma by excavating the association relation between the life eating habit data and the occurrence of colorectal adenoma, and provides a borrowed method for screening the risk factor of colorectal adenoma.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a feature selection flow chart of the present invention;

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Embodiment one, referring to fig. 1, a method for screening risk factors for sporadic colorectal adenomas based on a directionally weighted association rule model, comprises the following specific steps:

s1, preprocessing colorectal adenoma data. Deleting irrelevant data, deleting redundant information, deleting characteristic columns with deletion values exceeding 50%, and deleting dirty data with obvious anomalies. A total of 234 were included in the standard dataset, 62 of which were diagnosed with colorectal adenoma.

Referring to fig. 2, feature selection is performed using a random forest average non-purity reduction method.

(1) Calculating the information entropy of the original data to obtain the initial information entropy as follows:

(2) The information entropy H2 is calculated by taking the classification by features 7 and 24 as examples:

h2 (classified by feature 7) = 0.8283984298779227;

h2 (classified by feature 24) = 0.7903757392936914.

(3) The information gain info_gain is calculated to be classified as example according to features 7 and 24:

info_gain (classified by feature 7) =h1-H2 (classified by feature 7) = 0.8341351937-0.7903757392 = 0.0057367638;

info_gain (classified by feature 24) =h1-H2 (classified by feature 24) = 0.8341351937-0.8283984298 = 0.0437594544.

(4) The characteristic attribute with larger gain is reserved, and the characteristic index corresponding to the optimal information gain is obtained to be 24. The preferred feature set is the top 24 features of the feature importance rank.

S3, analyzing by using a directional weighted association rule model;

through repeated experiments, the experimental parameters are selected, the maximum mining item is 5, the latter item is 'bq_1' (the pathology is 1, namely, colorectal adenoma is suffered from), the minimum weighted support degree is 0.3, and the minimum confidence degree is 0.5. The preferred index set is input into a directionally weighted association rule model, which generates a rule pattern of 44 colorectal adenoma incidences.

And S4, incorporating the risk factors contained in the strong association rule generated in the step S3 into a risk factor set and communicating with an expert. The 44 regulations contain 7 important characteristics, and include some traditional risk factors and some non-traditional risk factors, so that the effectiveness and correctness of the method are proved.

The above embodiments are only examples of some of the data of the present invention, and are not intended to limit the present invention, and any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A risk factor screening method for sporadic colorectal adenomas based on a directionally weighted association rule model comprises the following specific steps:

s1, preprocessing data and the data;

s2, selecting characteristics by adopting a random forest average non-purity reduction method to obtain a preferred index set;

s3, analyzing by using a directional weighted association rule model;

s4, bringing the risk factors contained in the strong association rule generated in the S3 into a risk factor set, and communicating with an expert;

the data in step S1 includes the following data segment items:

the data column field comprises basic information, disease states, life habits, eating habits and risk factor characteristics related to life eating habits which are preliminarily screened by 79 experts in five aspects of colonoscopy results;

wherein, the basic information includes: name, gender, age, race, telephone number, education level, local residents, height, weight, occupation, marital status, household income;

(1) The disease condition includes (1) a past medical history: diabetes history, hypertension history, coronary heart disease history, chronic liver disease history, chronic kidney disease history, chronic bronchitis disease history, cerebrovascular disease history, hyperlipidemia disease history, fatty liver disease history, cholecystectomy history, enterosurgical history, gastric surgery history, esophageal surgery history, other diseases or surgery history; (2) current medical history: abdominal pain, abdominal distension, diarrhea, constipation, bloody stool, mucous stool, and other symptoms; (3) antibiotics are used;

(2) Lifestyle habits include: smoking, staying up night, exercising and going out;

(3) Eating habits include (1) frequency and cooking pattern of the seawater product: fresh sea fish, fresh frozen fish fillets, salted sea fish and dried fish, spicy sea fish and dried fish, fresh sea shrimp/crab/shellfish/snails, fresh frozen shrimp/crab/shellfish/snails, salted sea shrimp/crab/shellfish/snails, drunk sea shrimp/crab/shellfish/snails, sea plants, salted sea plants; (2) frequency and cooking processing mode of livestock meat: freshly slaughtered pork/beef/mutton/chicken/duck meat, freshly slaughtered animal viscera, cured processed meat products, spicy processed meat products; (3) fresh water product frequency and cooking method: fresh freshwater fish, salted freshwater fish, spicy freshwater fish, fresh freshwater shrimp/crab/shellfish/snails, salted freshwater shrimp/crab/shellfish/snails, drunk freshwater shrimp/crab/shellfish/snails; (4) poultry eggs/milks and milk products: common milk, low fat/skimmed milk, yogurt, milk powder, hen eggs/duck eggs/quail eggs, and processed eggs by curing; (5) snack type: processed carbohydrate, processed meat, processed preserved fruit; (6) vegetables/melons and fruits and cooking processing modes: fresh vegetables, pickled vegetables, mushrooms, melons and fresh fruits; (7) drinking water beverages: potable tap water, potable mineral water, potable purified water, carbonated beverages, and fruit juice beverages; (8) alcoholic beverage: mixing low-alcohol Chinese liquor, high-alcohol Chinese liquor, red wine, yellow wine, beer, fruit wine, alcoholic beverage and various wines;

(4) The colonoscopy results included: examination results, examination sites and pathology results; the pathological outcome is used to determine if they are colorectal adenoma patients;

the data preprocessing in the step S1 comprises the following steps:

s101, deleting irrelevant data;

s102, deleting redundant information, deleting characteristic columns with deletion values exceeding 50%, and deleting dirty data with obvious abnormality;

s103, data conversion;

the step S2 of selecting the characteristics by adopting a method for reducing average non-purity of random forests comprises the following steps:

s201, calculating information entropy H1 of original data:

s203, calculating information gain: info_gain=h1-H2;

s204, calculating information gains corresponding to all the features according to the S202 and the S203, and reserving feature attributes with larger gains;

s205, according to the feature index corresponding to the maximum information gain, putting the previous features into a set to be used as a preferred feature set;

the step S3 includes the steps of:

definition: let i= { I ₁ ,i ₂ ,…,i _m Is the set of item attributes, denoted D is the set of transactions T, where T is the set of item attributes, andwith unique identification for each transaction T, denoted TID, let X be a collection of items in I, ifThen transaction T is said to contain X;

item attribute i _j The weight of (a) is a value related to the item property, denoted w (i _j ) The method comprises the steps of carrying out a first treatment on the surface of the Item attribute i _j Probability of occurrence in transaction set D P (i _j )，w(i _j ) Namely P (i) _j ) Is the reciprocal of (2); the weight of a patient transaction refers to the weight of a record in the patient data set, denoted w (T _k ) Is all of T _k The average value of the weights of the item attributes; wherein T is _k Is the kth record in transaction set D;

formula (1):

；

s301, scanning a data table D to obtain each item attribute i _j And calculate the weight w (T) by the formula (1) _k )；

S302, scanning a data table D, setting pathology as a post term after_item of an association rule, and putting all other features into a set Q; setting a minimum support threshold value min_sup, a minimum confidence threshold value min_conf and a maximum circulation number max_rule_length;

s303, initializing frequent 1-item sets: all items in Q are connected with the later term after_item, and items with weighted support degree larger than min_sup are selected to be put into L0, wherein the weighted support degree is calculated through a formula (2);

s304, generating frequent (k+1) -item sets by using the frequent k-item sets: firstly, generating a frequent 1-item set L1, and regenerating a frequent 2-item set L2 until the maximum length r of the generation rule generates Lr, and stopping the algorithm; the Lk mode generated by Ck is: calculating the weighted support sup1 of each item in Ck and the weighted support sup2 of each item after the after_item is removed, and putting the item with the weighted support sup1 larger than min_sup into L (k+1);

S305、L＝[L2,…,Lr]the method comprises the steps of carrying out a first treatment on the surface of the Calculating a ratio conf of weighted support of each frequent item set (L-after_item) to after_item in L through a formula (3); calculating a lifting degree lift through a formula (4); if conf is greater than min_conf, outputting a strong association rule