CN111863266A - Method for screening risk factors of sporadic colorectal adenomas based on directional weighted association rule model - Google Patents
Method for screening risk factors of sporadic colorectal adenomas based on directional weighted association rule model Download PDFInfo
- Publication number
- CN111863266A CN111863266A CN202010057865.7A CN202010057865A CN111863266A CN 111863266 A CN111863266 A CN 111863266A CN 202010057865 A CN202010057865 A CN 202010057865A CN 111863266 A CN111863266 A CN 111863266A
- Authority
- CN
- China
- Prior art keywords
- item
- history
- association rule
- data
- risk factors
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model, and belongs to the field of data mining. The invention firstly preprocesses the data; then, extracting features by adopting a feature selection method for reducing the average impurity degree of the random forest, and determining optimal division nodes by using information gain to obtain an optimal feature set; then, the preferred feature set is input into a directional weighted association rule model to generate a strong association rule. And finally, bringing the risk factors contained in the strong association rule into a risk factor set and communicating with an expert. Compared with the prior art, the invention mainly provides a directional weighting association rule model to screen the risk factors of colorectal adenomas, confirms the important significance of the life eating habit factors in the etiology of colorectal adenomas, discovers undetected high-risk factors in the previous research and provides a reference method for searching the risk factors of colorectal adenomas.
Description
Technical Field
The invention relates to medical data analysis, in particular to a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model.
Background
Sporadic colorectal adenomas (CRA) are benign glandular tumors of the colon and rectum, and are the precursor lesions of colorectal cancer. The early detection and timely treatment can effectively reduce the canceration probability of the patient, and have important significance for prolonging the survival time of the patient. Investigation studies have found that CRA is closely related to dietary habits, and 66% to 78% of colorectal adenomas can be avoided by healthy lifestyle habits. However, some important risk factors are still ignored or not discovered, so that the patient cannot be effectively guided to live healthily and the current situation is improved.
In recent years, more and more researchers have become aware of the importance of lifestyle habit factors in the etiology of colorectal adenomas, and have been invested in the study of risk factors for colorectal adenomas. However, the methods are too single in the aspect of risk factor analysis, the traditional methods have certain effects on single factor analysis, but are not perfect enough, and some important risk factors with small probability are easy to miss. In order to overcome the problems, a directional weighting association rule model is provided, and is an efficient association rule mining model constructed by combining a probability calculation weighting support degree and a fixed posterior term mode. The risk factors for colorectal adenomas are analyzed by generating a regular pattern of colorectal adenoma onset.
Disclosure of Invention
The invention aims to: in order to solve the technical problems related to the background art, a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model is provided. The technical scheme adopted by the invention is as follows:
a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model comprises the following specific steps:
s1, preprocessing data;
s2, selecting characteristics by adopting a method of random forest average impure degree reduction;
s3, analyzing by using a directional weighted association rule model;
and S4, incorporating the risk factors contained in the strong association rule generated in S3 into a risk factor set, and communicating with an expert.
As a further technical scheme of the invention: s1, the data preprocessing comprises the following steps:
s101, deleting irrelevant data;
s102, deleting redundant information, deleting characteristic columns with missing values exceeding 50%, and deleting dirty data with obvious abnormality.
S103, data conversion;
as a further technical scheme of the invention: s2, the method for selecting the features by adopting the random forest average impure degree reduction method comprises the following steps:
S202, selecting a feature, classifying data according to the feature value, respectively calculating information entropy of each class, and summing the information entropy according to proportion to obtain the information entropy H2 of the division mode;
s203, calculating information gain: info _ gain ═ H1-H2;
and S204, calculating information gains corresponding to all the characteristics according to the S202 and the S203, and reserving the characteristic attribute with larger gain.
And S205, putting the previous features into a set according to the feature index corresponding to the maximum information gain, and taking the previous features as a preferred feature set.
The step S3 includes the following steps:
as a further technical scheme of the invention: the step of S3, analyzing by using the directional weighted association rule model, includes the following steps:
defining: let I ═ I1,i2,…,imIs the item attribute set. Let D be a collection of transactions T, where T is a collection of item attributes, anThere is a unique identifier corresponding to each transaction T, and it is marked as TID. Let X be a set of items in I, ifThen transaction T is said to contain X.
Defining: item Property ijIs a value associated with an item property, denoted as w (i)j). Item Property ijProbability P (i) of occurrence in transaction set Dj),w(ij) Is P (i)j) The reciprocal of (c). The weight of a patient transaction refers to the weight of a record in the patient data set, denoted as w (T) k) Is all that belong to TkThe average value of the weight of the item attribute of (1); wherein T iskIs the kth record in transaction set D;
formula (1):
formula (2): the weighted support of the association rule A- - > B is denoted wsp (A, B),
formula (3): the confidence of the association rule A- - > B is denoted as conf (A, B):
formula (4): the lifting degree of the association rule A- - > B is marked as lift (A, B), if lift (A, B) >1 indicates A, B is in positive correlation, lift (A, B) <1 indicates A, B is in negative correlation, and lift (A, B) <1 indicates A, B is irrelevant:
s301, scanning the database D to obtain each item attribute ijAnd calculating to obtain the weight w (T) of the weightk) (see formula (1) for a specific calculation mode);
s302, scanning a database D, setting a pathology as a post item after _ item of an association rule, and putting all other features into a set Q; setting a minimum support threshold value min _ sup, a minimum confidence threshold value min _ conf and a maximum cycle number max _ rule _ length;
and S303, initializing a frequent 1-item set. All items in the Q are connected with the back item after _ item, and items with the weighting support degree greater than min _ sup are selected and put into the L0 (the weighting support degree is calculated and shown in a formula (2));
and S304, generating a frequent (k +1) -item set by utilizing the frequent k-item set. The core method is a recursion method based on a frequency set theory, firstly generating a frequent 1-item set L1, and then generating a frequent 2-item set L2 until the maximum length r of a generation rule generates Lr, and then stopping the algorithm. Here in the kth cycle, the process first generates a set Ck of candidate k-term sets, each term set in Ck being generated by Lk-1 as a self-join. The set of terms in Ck is a candidate set for generating the frequent item set, and the final frequent item set Lk must be a subset of Ck. The mode of generating Lk by Ck is as follows: and calculating the weighted support sup1 of each item in the Ck and the weighted support sup2 of each item after the after _ item is removed, and putting the items with the weighted support sup1 larger than min _ sup into L (k + 1).
S305、L=[L2,…,Lr]Calculating the ratio conf (the calculation mode is shown in formula (3)) and the lift (the calculation mode is shown in formula (4)) of the weighting support degree of each frequent item set (L-after _ item) and after _ item in the L, and outputting a strong association rule if conf is more than min _ confCompared with the prior art, the invention has the advantages or positive effects
1. The invention provides a directional weighting association rule model-based risk factor screening method for sporadic colorectal adenomas.
2. The method and the device preferentially construct the optimal feature set, and are favorable for improving the accuracy of the analysis result and shortening the calculation process.
3. The invention aims at the life eating habit data, analyzes the high-risk factor of the colorectal adenoma by excavating the incidence relation between the life eating habit data and the colorectal adenoma, and provides a set of reference worthy method for screening the risk factor of the colorectal adenoma.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a flow chart of feature selection in accordance with the present invention;
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In a first embodiment, referring to fig. 1, a method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model includes the following specific steps:
and S1, preprocessing the data of the colorectal adenoma. Deleting irrelevant data, deleting redundant information, deleting characteristic columns with missing values exceeding 50%, and deleting dirty data with obvious abnormality. A total of 234 were included in the standard dataset, of which 62 were diagnosed with colorectal adenomas.
Referring to fig. 2, feature selection is performed by using a random forest average impure degree reduction method.
(1) Calculating the information entropy of the original data to obtain the initial information entropy as follows:
(2) the entropy H2 is determined, for example, by classifying according to features 7 and 24:
h2 (classified by feature 7) ═ 0.8283984298779227;
h2 (classified by feature 24) 0.7903757392936914.
(3) The information gain info _ gain is calculated, classified as an example according to features 7 and 24:
info _ gain (classified by feature 7) ═ H1-H2 (classified by feature 7) ═ 0.8341351937-0.7903757392 ═ 0.0057367638;
info _ gain (classified by feature 24) ═ H1-H2 (classified by feature 24) ═ 0.8341351937-0.8283984298 ═ 0.0437594544.
(4) The characteristic attribute with larger gain is reserved, and the characteristic index corresponding to the optimal information gain is obtained as 24. The preferred feature set is the 24 features ranked 24 top in feature importance.
S3, analyzing by using a directional weighted association rule model;
through repeated experiments, experimental parameters are selected, the maximum mining item is 5, the subsequent item is 'bq _ 1' (the pathology is 1, namely colorectal adenoma is suffered), the minimum weighting support degree is 0.3, and the minimum confidence degree is 0.5. Inputting the preferred index set into the directional weighted association rule model generates 44 rule patterns of colorectal adenoma onset.
And S4, incorporating the risk factors contained in the strong association rule generated in S3 into a risk factor set, and communicating with an expert. The 44 rules contain 7 important features, and traditional risk factors and non-traditional risk factors are contained in the 7 important features, so that the effectiveness and the correctness of the method are proved.
The above description is only exemplary of the present invention and should not be taken as limiting the invention, and any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (1)
1. A method for screening risk factors of sporadic colorectal adenomas based on a directional weighted association rule model comprises the following specific steps:
s1, preprocessing data and data;
S2, selecting characteristics by adopting a random forest average impure degree reducing method to obtain an optimal index set;
s3, analyzing by using a directional weighted association rule model;
s4, bringing the risk factors contained in the strong association rules generated in S3 into a risk factor set, and communicating with experts;
the data in the step S1 includes the following data segment items:
the data column field comprises 79 risk factor characteristics which are preliminarily screened by experts and are related to the life eating habits in five aspects of basic information, disease states, life habits, eating habits and colonoscopy results;
wherein the basic information includes: name, gender, age, race, phone number, education level, local residents, height, weight, occupation, marital status, household income;
(1) the disease conditions include (i) past medical history: history of diabetes, history of hypertension, history of coronary heart disease, history of chronic liver disease, history of chronic kidney disease, history of chronic bronchitis, history of cerebrovascular disease, history of hyperlipidemia, history of fatty liver disease, history of cholecystectomy, history of intestinal surgery, history of gastric surgery, history of esophageal surgery, history of other diseases or operations; second, the current medical history: abdominal pain, abdominal distension, diarrhea, constipation, bloody stools, mucous stools, other symptoms; thirdly, antibiotics are used;
(2) The living habits comprise: smoking, staying up, exercising and going out;
(3) the eating habits include (i) frequency and cooking profile of the seawater product: processed seawater plants such as cooked fresh seawater fish, uncooked fresh frozen fish fillet, pickled seawater fish and dried fish, spicy seawater fish and dried fish, cooked fresh seawater shrimp/crab/shellfish/snail, uncooked fresh frozen shrimp/crab/shellfish/snail, pickled seawater shrimp/crab/shellfish/snail, liquor-preserved seawater shrimp/crab/shellfish/snail, seawater plant, pickled seawater plant, etc.; ② the frequency and cooking processing mode of the livestock and poultry meat: newly slaughtered pork/cattle/sheep/chicken/duck meat, newly slaughtered animal viscera, cured processed meat product, barbecued processed meat product, smoked processed meat product, and spicy processed meat product; ③ the frequency and cooking method of the fresh water products: fresh freshwater fish, pickled freshwater fish, spicy freshwater fish, fresh freshwater shrimps/crabs/shellfish/snails, pickled freshwater shrimps/crabs/shellfish/snails, and drunk freshwater shrimps/crabs/shellfish/snails; poultry eggs/milk and dairy products: processed poultry eggs such as plain milk, low-fat/skim milk, yogurt, milk powder, chicken eggs/duck eggs/quail eggs, marinated, and the like; snack type: processed carbohydrates, processed meat, processed preserved fruit; sixthly, the vegetables/melons and fruits and the cooking processing mode are as follows: fresh vegetables, pickled processed vegetables, mushrooms, melons and fresh fruits; seventh, drinking water and beverage: drinkable tap water, drinkable mineral water, drinkable purified water, carbonated beverages, fruit juice beverages; eighty drinks: mixing low-degree Chinese liquor, high-degree Chinese liquor, red wine, yellow wine, beer, fruit wine, alcoholic beverage, and various wines;
(4) The colonoscopy results include: examination results, examination sites and pathological results. The pathological outcome is used to determine whether they are colorectal adenoma patients;
the data preprocessing in the step S1 includes the following steps:
s101, deleting irrelevant data;
s102, deleting redundant information, deleting characteristic columns with missing values exceeding 50%, and deleting dirty data with obvious abnormality;
s103, data conversion;
the step S2 of selecting the features by adopting the method of reducing the average mean purity of the random forest comprises the following steps:
s202, selecting a feature, classifying data according to the feature value, respectively calculating information entropy of each class, and summing the information entropy according to proportion to obtain the information entropy H2 of the division mode;
s203, calculating information gain: info _ gain ═ H1-H2;
s204, calculating information gains corresponding to all the characteristics according to S202 and S203, and reserving the characteristic attribute with larger gain;
s205, putting the previous features into a set according to the feature index corresponding to the maximum information gain, and taking the previous features as a preferred feature set;
the step S3 includes the following steps:
defining: let I ═ I1,i2,…,imIs an item attribute set, denoted D as the set of transactions T, where T is the set of item attributes, and Having a unique identifier corresponding to each transaction T, denoted TID, and if X is a set of entries in I, ifThen the transaction T is said to contain X;
item Property ijIs a value associated with an item property, denoted as w (i)j). Item Property ijProbability P (i) of occurrence in transaction set Dj),w(ij) Is P (i)j) The reciprocal of (a); the weight of a patient transaction refers to the weight of a record in the patient data set, denoted as w (T)k) Is all that belong to TkThe average value of the weight of the item attribute of (1); wherein T iskIs the kth record in transaction set D;
formula (1):
formula (2): the weighted support of the association rule A- - > B is denoted wsp (A, B),
formula (3): the confidence of the association rule A- - > B is denoted as conf (A, B):
formula (4): the lifting degree of the association rule A- - > B is marked as lift (A, B), if lift (A, B) >1 indicates A, B is in positive correlation, lift (A, B) <1 indicates A, B is in negative correlation, and lift (A, B) <1 indicates A, B is irrelevant:
s301, scanning the data table D to obtain each item attribute ijAnd calculating the weight w (T) through the formula (1)k);
S302, scanning a data table D, setting the pathology as a post item after _ item of an association rule, and putting all other features into a set Q; setting a minimum support threshold value min _ sup, a minimum confidence threshold value min _ conf and a maximum cycle number max _ rule _ length;
S303, initializing a frequent 1-item set: all items in the Q are connected with the back item after _ item, and the items with the weighting support degree greater than min _ sup are selected and put into the L0, wherein the weighting support degree is calculated through a formula (2);
s304, generating a frequent (k +1) -item set by using the frequent k-item set: firstly, generating a frequent 1-item set L1, and then generating a frequent 2-item set L2 until the maximum length r of a generation rule generates Lr, and stopping the algorithm; the way of generating Lk by Ck is as follows: calculating the weighted support sup1 of each item in the Ck and the weighted support sup2 of each item after the after _ item is removed, and putting the items with the weighted support sup1 larger than min _ sup into L (k + 1);
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010057865.7A CN111863266B (en) | 2020-01-16 | 2020-01-16 | Dangerous factor screening method for sporadic colorectal adenoma based on directional weighted association rule model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010057865.7A CN111863266B (en) | 2020-01-16 | 2020-01-16 | Dangerous factor screening method for sporadic colorectal adenoma based on directional weighted association rule model |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111863266A true CN111863266A (en) | 2020-10-30 |
CN111863266B CN111863266B (en) | 2023-09-19 |
Family
ID=72984863
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010057865.7A Active CN111863266B (en) | 2020-01-16 | 2020-01-16 | Dangerous factor screening method for sporadic colorectal adenoma based on directional weighted association rule model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111863266B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113192632A (en) * | 2021-05-24 | 2021-07-30 | 哈尔滨理工大学 | Breast cancer classification method based on weighted association rule algorithm |
CN117352178A (en) * | 2023-11-10 | 2024-01-05 | 西安艾派信息技术有限公司 | Big data-based drug risk assessment system and method |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095673A (en) * | 2015-08-26 | 2015-11-25 | 中国人民解放军军事医学科学院放射与辐射医学研究所 | Construction method of chronic disease risk model on the basis of medical big data mining |
CN109543963A (en) * | 2018-11-06 | 2019-03-29 | 深圳信息职业技术学院 | A kind of big data analysis method and system based on student's study habit |
EP3543702A1 (en) * | 2018-03-23 | 2019-09-25 | Roche Diabetes Care GmbH | Methods for screening a subject for the risk of chronic kidney disease and computer-implemented method |
CN110334737A (en) * | 2019-06-04 | 2019-10-15 | 阿里巴巴集团控股有限公司 | A kind of method and system of the customer risk index screening based on random forest |
-
2020
- 2020-01-16 CN CN202010057865.7A patent/CN111863266B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095673A (en) * | 2015-08-26 | 2015-11-25 | 中国人民解放军军事医学科学院放射与辐射医学研究所 | Construction method of chronic disease risk model on the basis of medical big data mining |
EP3543702A1 (en) * | 2018-03-23 | 2019-09-25 | Roche Diabetes Care GmbH | Methods for screening a subject for the risk of chronic kidney disease and computer-implemented method |
CN109543963A (en) * | 2018-11-06 | 2019-03-29 | 深圳信息职业技术学院 | A kind of big data analysis method and system based on student's study habit |
CN110334737A (en) * | 2019-06-04 | 2019-10-15 | 阿里巴巴集团控股有限公司 | A kind of method and system of the customer risk index screening based on random forest |
Non-Patent Citations (1)
Title |
---|
马依热古丽·尼斯尔等: "关联规则联合logistic回归分析新疆乳腺癌发病影响因素", 《医学临床研究》, vol. 36, no. 1, pages 142 - 144 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113192632A (en) * | 2021-05-24 | 2021-07-30 | 哈尔滨理工大学 | Breast cancer classification method based on weighted association rule algorithm |
CN117352178A (en) * | 2023-11-10 | 2024-01-05 | 西安艾派信息技术有限公司 | Big data-based drug risk assessment system and method |
Also Published As
Publication number | Publication date |
---|---|
CN111863266B (en) | 2023-09-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Cai et al. | Dietary patterns and their correlates among middle-aged and elderly Chinese men: a report from the Shanghai Men's Health Study | |
CN111863266B (en) | Dangerous factor screening method for sporadic colorectal adenoma based on directional weighted association rule model | |
Piperata et al. | Nutrition in transition: dietary patterns of rural Amazonian women during a period of economic change | |
De Stefani et al. | Dietary patterns and risk of gastric cancer: a case-control study in Uruguay | |
Nomura et al. | Breast cancer and diet among the Japanese in Hawaii | |
Rosinger et al. | Water from fruit or the river? Examining hydration strategies and gastrointestinal illness among Tsimane’adults in the Bolivian Amazon | |
CN114049339B (en) | Fetal cerebellum ultrasonic image segmentation method based on convolutional neural network | |
CN114582516A (en) | Disease multi-source data processing method and device, storage medium and electronic device | |
CN110264465A (en) | A kind of dissection of aorta dynamic testing method based on morphology and deep learning | |
Campagna et al. | Risk of lymphoma subtypes and dietary habits in a Mediterranean area | |
Giussi et al. | Biology and fishery of long tail hake (Macruronus magellanicus) in the Southwest Atlantic Ocean. | |
Binh et al. | Gross domestic product and dietary pattern among 49 western countries with a focus on polyamine intake | |
Matsuba et al. | Overview of epidemiology of bile duct and gallbladder cancer focusing on the JACC Study | |
Pache et al. | Prediction of fingerling biomass with deep learning | |
Palma et al. | The “Mediterraneanisation” of food fashions in the world | |
Liu et al. | From Canteen Food to Daily Meals: Generalizing Food Recognition to More Practical Scenarios | |
CN117747123A (en) | Method and system for constructing chronic disease occurrence risk prediction model of physical examination crowd | |
Anderson | Trends, drivers, and ecosystem effects of expanding global invertebrate fisheries | |
Gherman et al. | Technical report: Review of quantitative risk assessment of foodborne norovirus transmission | |
Martinez et al. | Retail prices for sustainable, healthy diets: are foods with lower environmental impacts and healthier nutritional profiles also more expensive? | |
Kvan et al. | DEEP LEARNING MODELS FOR PREDICTING THE RISK OF CARDIOVASCULAR INCIDENTS BASED ON THE WISCONSIN LONGITUDINAL STUDY | |
Mao | Learning Based Food Image Analysis-Detection, Recognition and Segmentation | |
Vatanparast et al. | Do seafood consumers have a higher diet quality compared to non-consumers? A Canadian perspective | |
Nasution et al. | The Relationship Between Eating Behavior and Diseases Experienced by Malay Families in Medan | |
Kmietowicz | Fried food linked to increased risk of death among older US women |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |