CN116092680A

CN116092680A - Abdominal aortic aneurysm early prediction method and system based on random forest algorithm

Info

Publication number: CN116092680A
Application number: CN202310212701.0A
Authority: CN
Inventors: 陈松; 廖海; 梁艳
Original assignee: Chengdu Technological University CDTU
Current assignee: Chengdu Technological University CDTU
Priority date: 2023-03-08
Filing date: 2023-03-08
Publication date: 2023-05-09
Anticipated expiration: 2043-03-08
Also published as: CN116092680B

Abstract

The invention discloses an early abdominal aortic aneurysm prediction method and system based on a random forest algorithm, which relate to the field of data processing and have the technical scheme that: obtaining a first dataset of characteristic information of an abdominal aortic aneurysm patient, wherein the characteristic information comprises age, gender, whether smoke is drawn, whether obesity, whether hypertension is present, and whether there is a family history of abdominal aortic aneurysm; training the first data set based on a random forest algorithm to obtain an abdominal aortic aneurysm prediction model; and acquiring a second data set of the patient to be predicted, which corresponds to the characteristic information of the patient with the abdominal aortic aneurysm, inputting the second data set into the abdominal aortic aneurysm prediction model for prediction, and outputting a prediction result of the patient to be predicted. The abdominal aortic aneurysm prediction model based on random forest algorithm training has good generalization performance, and a doctor can know the whole process of abdominal aortic aneurysm prediction, so that a trusted relationship is established between the doctor and an artificial intelligence system.

Description

Abdominal aortic aneurysm early prediction method and system based on random forest algorithm

Technical Field

The invention relates to the field of data processing, in particular to an early abdominal aortic aneurysm prediction method and system based on a random forest algorithm.

Background

The rupture of an Abdominal Aortic Aneurysm (AAA) can lead to life threatening events, and it is important to predict the likelihood that an abdominal aortic aneurysm will take early intervention and control. Ultrasound screening and Computed Tomography (CT) images have been used for analysis of abdominal aortic aneurysms. Currently, many studies have applied artificial intelligence techniques to aid diagnosis and treatment of vascular surgery, such as prediction of risk of rupture of abdominal aortic aneurysms. These studies have focused on analysis after abdominal aortic aneurysm formation only. Prior art solutions can generally be divided into three categories: predicting risk and outcome of rupture after intra-luminal repair of an abdominal aortic aneurysm. The evolution of the abdominal aortic aneurysm is predicted based on a deep learning algorithm to obtain a risk of rupture. The causative factors of the abdominal aortic aneurysm are analyzed for intervention and control.

In the prior art, the prediction method mainly has the following defects: 1. lacking in interpretability, current research is focused mainly on post-aneurysmal analysis, but it is clear that it has caused tremendous harm to the patient's body. Secondly, the "black box" nature of artificial intelligence prevents its further application in the medical field, and it is difficult for a doctor to understand the whole process of predicting an abdominal aortic aneurysm by a prediction algorithm, and a mutual trust relationship cannot be established between the doctor and the artificial intelligence system. 2. Lacking effective prediction before an aneurysm formation, many studies have now applied artificial intelligence techniques to analysis and prediction of vascular surgery, such as ultrasound screening and Computed Tomography (CT) images, to analysis of abdominal aortic aneurysms, but these methods focus only on analysis after an abdominal aortic aneurysm formation, ignoring how to effectively predict an abdominal aortic aneurysm before an abdominal aortic aneurysm formation.

Disclosure of Invention

The invention aims at solving the defect of early prediction of abdominal aortic aneurysm in the prior art and provides an early prediction method and system of abdominal aortic aneurysm based on a random forest algorithm. When a new sample set is predicted, which classification the sample set belongs to depends on the judgment of each decision tree, the prediction model itself consists of a plurality of classifiers, and then a final prediction result is obtained through voting and averaging.

The technical aim of the invention is realized by the following technical scheme:

in a first aspect of the present application, there is provided an early prediction method for abdominal aortic aneurysm based on random forest algorithm, the method comprising:

obtaining a first dataset of characteristic information of an abdominal aortic aneurysm patient, wherein the characteristic information comprises age, gender, whether smoke is drawn, whether obesity, whether hypertension is present, and whether there is a family history of abdominal aortic aneurysm;

training the first data set based on a random forest algorithm to obtain an abdominal aortic aneurysm prediction model;

and acquiring a second data set of the patient to be predicted, which corresponds to the characteristic information of the patient with the abdominal aortic aneurysm, inputting the second data set into the abdominal aortic aneurysm prediction model for prediction, and outputting a prediction result of the patient to be predicted.

In one embodiment, the first dataset is trained based on a random forest algorithm to obtain an abdominal aortic aneurysm prediction model, in particular:

randomly extracting characteristic information of the first data set to generate a plurality of random sample sets;

generating a classification regression tree of a plurality of random sample sets based on a classification regression algorithm, and calculating the coefficient of the basis of the random sample sets;

for a plurality of random sample sets where the current node of the classification regression tree is located, returning the classification regression algorithm to the sub-decision tree and stopping recursion when the coefficient of the foundation of the plurality of random sample sets is smaller than the threshold value of the coefficient of the foundation; or when the coefficient of the basis of the random sample sets is not smaller than the threshold value of the basis of the coefficient, calculating the coefficient of the basis of each piece of characteristic information in the random sample set where the current node is located;

taking the feature information corresponding to the minimum coefficient of the coefficient of each feature information as optimal feature information, dividing a corresponding random sample set into a first information set and a second information set based on the optimal feature information, and taking the first information set and the second information set as a left child node and a right child node of the current node respectively;

calculating the coefficient of the first information set and the second information set, returning the sub decision trees by the classification regression algorithm and stopping recursion when the coefficient of the first information set and the second information set is smaller than the threshold value of the coefficient of the first information set and the second information set, and generating a plurality of decision trees;

and learning the decision trees by a plurality of classifiers to obtain an abdominal aortic aneurysm prediction model.

In one embodiment, the second data set is input into the abdominal aortic aneurysm prediction model for prediction, and a prediction result of a patient to be predicted is output, specifically:

dividing the second data set into a training set and a testing set according to a preset proportion;

randomly sampling the characteristic information of the training set for a plurality of times to obtain a sampling sample set, wherein the sampled characteristic information is put back into a second data set for next random sampling after each random sampling;

generating a plurality of weight values for the sample set based on the plurality of decision trees;

and calculating the sum value of the plurality of weight values, and dividing the sum value of the plurality of weight values by the number of random sampling to obtain a prediction result of the patient to be predicted.

In one embodiment, the method further comprises:

presetting a matching rule of characteristic information of a first data set, correcting a weight value output by a decision tree based on the matching rule, and summing the corrected weight value to obtain a weight correction value;

predicting the test set based on the abdominal aortic aneurysm prediction model to obtain a weight average value of the test set;

and when the weight correction value is equal to the weight average value, outputting a prediction result of the patient to be predicted.

In one embodiment, a plurality of characteristic information is randomly extracted from the first data set to obtain the random sample set, wherein the data format of the first data set is a multi-row multi-column data matrix.

In one embodiment, the plurality of output results of the abdominal aortic aneurysm prediction model are selected using a receiving operator characteristic curve, such that the abdominal aortic aneurysm outputs a predicted result of a patient to be predicted.

In a second aspect of the present application, there is provided an early prediction system for abdominal aortic aneurysm based on random forest algorithm, comprising:

an acquisition module for acquiring a first data set of characteristic information of an abdominal aortic aneurysm patient, wherein the characteristic information comprises age, gender, whether smoking, whether obesity, whether hypertension and whether there is a family medical history of abdominal aortic aneurysm;

the training module is used for training the first data set based on a random forest algorithm to obtain an abdominal aortic aneurysm prediction model;

the prediction module is used for acquiring a second data set of the patient to be predicted, which corresponds to the characteristic information of the patient with the abdominal aortic aneurysm, inputting the second data set into the abdominal aortic aneurysm prediction model for prediction, and outputting a prediction result of the patient to be predicted.

In one embodiment, the training module comprises:

the random extraction module is used for randomly extracting the characteristic information of the first data set to generate a plurality of random sample sets;

the classification tree generation module is used for generating classification regression trees of a plurality of random sample sets based on a classification regression algorithm and calculating the coefficient of the basis of the random sample sets;

the judging module is used for returning the classification regression algorithm to the sub-decision tree and stopping recursion when the coefficient of the foundation of the plurality of random sample sets is smaller than the threshold value of the coefficient of the foundation for the plurality of random sample sets where the current node of the classification regression tree is located; or when the coefficient of the basis of the random sample sets is not smaller than the threshold value of the basis of the coefficient, calculating the coefficient of the basis of each piece of characteristic information in the random sample set where the current node is located;

the sample dividing module is used for taking the characteristic information corresponding to the minimum coefficient of the coefficient of each characteristic information as optimal characteristic information, dividing a corresponding random sample set into a first information set and a second information set based on the optimal characteristic information, and taking the first information set and the second information set as a left child node and a right child node of the current node respectively;

the decision tree generation module is used for calculating the coefficient of the first information set and the second information set, and when the coefficient of the first information set and the second information set is smaller than the threshold value of the coefficient of the first information set and the second information set, the classification regression algorithm returns to the sub-decision tree and stops recursion to generate a plurality of decision trees;

and the learning module is used for learning the plurality of decision trees by the plurality of classifiers to obtain an abdominal aortic aneurysm prediction model.

In one embodiment, the prediction module is further configured to: dividing the second data set into a training set and a testing set according to a preset proportion; randomly sampling the characteristic information of the training set for a plurality of times to obtain a sampling sample set, wherein the sampled characteristic information is put back into a second data set for next random sampling after each random sampling; generating a plurality of weight values for the sample set based on the plurality of decision trees; and calculating the sum value of the plurality of weight values, and dividing the sum value of the plurality of weight values by the number of random sampling to obtain a prediction result of the patient to be predicted.

In one embodiment, the system further comprises:

the correction module is used for presetting a matching rule of the characteristic information of the first data set, correcting the weight value output by the decision tree based on the matching rule, and summing the corrected weight value to obtain a weight correction value;

the weight calculation module is used for predicting the test set based on the abdominal aortic aneurysm prediction model to obtain a weight average value of the test set;

and the interpretation module is used for outputting a prediction result of the patient to be predicted when the weight correction value is equal to the weight average value.

Compared with the prior art, the invention has the following beneficial effects:

1. the invention provides an early prediction method of abdominal aortic aneurysm based on a random forest algorithm, which takes characteristic information such as age, sex, smoking, obesity, hypertension, family history of abdominal aortic aneurysm and the like of a patient as a first data set, trains the first data set based on the random forest algorithm, so as to obtain a prediction model based on the random forest algorithm, wherein the random forest consists of a plurality of decision trees, and all the trees are independent of each other. When a sample set is predicted, which classification the sample set belongs to depends on the judgment of each decision tree, the prediction model itself consists of a plurality of classifiers, and then a final prediction result is obtained through voting and averaging.

2. The invention further considers how to verify the reliability degree of the prediction result of the abdominal aortic aneurysm prediction model, and explains the decision process of the algorithm based on the voting tree of the random forest algorithm, wherein the random forest mainly comprises a plurality of CART trees. In order to clearly explain the decision process of the random forest, each characteristic information is taken as a decision tree, and the voting process of the random forest algorithm is deduced. The prediction result of the traditional random forest algorithm is unstable, an interpretable rule is used for assisting in the derivation process of the random forest, firstly, the output result of a decision tree is corrected by using a matching rule, then the correction result is further verified according to the label in the test set, and under the condition that the data set is unchanged, if the correction result is equal to the weight average value obtained by the abdominal aortic aneurysm prediction model on the test set, the prediction result is indicated to have higher prediction precision.

In addition, the application also provides an early abdominal aortic aneurysm prediction system based on a random forest algorithm, which has the same technical effects as the prediction method, and is not repeated here.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiments of the invention. In the drawings:

fig. 1 is a flow chart of an early abdominal aortic aneurysm prediction method based on a random forest algorithm according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a training process of a prediction model according to an embodiment of the present application;

fig. 3 is a schematic block diagram of an early abdominal aortic aneurysm prediction system based on a random forest algorithm according to an embodiment of the present application.

Description of the embodiments

For the purpose of making apparent the objects, technical solutions and advantages of the present invention, the present invention will be further described in detail with reference to the following examples and the accompanying drawings, wherein the exemplary embodiments of the present invention and the descriptions thereof are for illustrating the present invention only and are not to be construed as limiting the present invention.

It should be appreciated that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

As described in the background art, the method for predicting abdominal aortic aneurysm in the prior art mainly has the following drawbacks: 1. lacking in interpretability, current research is focused mainly on post-aneurysmal analysis, but it is clear that it has caused tremendous harm to the patient's body. Secondly, the "black box" nature of artificial intelligence prevents its further application in the medical field, and it is difficult for a doctor to understand the whole process of predicting an abdominal aortic aneurysm by a prediction algorithm, and a mutual trust relationship cannot be established between the doctor and the artificial intelligence system. 2. Lacking effective prediction before abdominal aortic aneurysm formation, many studies have now applied artificial intelligence techniques to vascular surgical analysis and prediction, such as ultrasound screening and Computed Tomography (CT) images, to abdominal aortic aneurysm analysis, but these methods focus only on post-abdominal aortic aneurysm analysis, neglecting how to effectively predict abdominal aortic aneurysm before it forms.

Referring to fig. 1, fig. 1 is a flow chart of an early abdominal aortic aneurysm prediction method based on a random forest algorithm according to an embodiment of the present application, as shown in fig. 1, the method includes the following steps:

s110, acquiring a first data set of characteristic information of an abdominal aortic aneurysm patient, wherein the characteristic information comprises age, gender, whether smoking, whether obesity, whether hypertension and whether there is a family medical history of the abdominal aortic aneurysm.

Specifically, datasets were created for training of artificial intelligence models based on several characteristics of patients with abdominal aortic aneurysms, including age, smoking, men, obesity, hypertension, family history of aneurysms, and the like. To facilitate subsequent random sampling or extraction of the characteristic information in the dataset, the definition will therefore generally express the dataset including the characteristic information of the patient with an abdominal aortic aneurysm in the form of a data matrix, e.g

In the formula, m represents the column number of the matrix, n represents the row number of the matrix, x represents the characteristic information of the first data set, y represents the tag information of the first data set, and S represents the first data set, specifically, because in the actual calculation process, in order to obtain the final prediction probability, the characteristic information is represented in binary, for example, the preset age threshold is 60 years for the age characteristics of one patient, and the preset age threshold may be 55 years for the patients with the age of 60 years or moreAge 50, etc., the characteristic information of the age is represented by 1, for example, the smoking characteristic of the patient is represented by 0, 1, or 1, 0, 1, and average daily smoking amount is represented by age>400,0 represents no smoking or average daily smoking amount × age<400;1 represents drinking, and the average drinking amount per week>7 standard drinking units, 0 means no drinking or average weekly drinking volume<7 standard drinking units; 1 represents a history of abdominal aortic aneurysm in three generations, 0 represents a history of no abdominal aortic aneurysm in three generations; the remaining features are similar and will not be described in further detail here.

And S120, training the first data set based on a random forest algorithm to obtain an abdominal aortic aneurysm prediction model.

In this embodiment, the random forest is an algorithm integrating multiple trees through the idea of ensemble learning, its basic unit is a decision tree, and its essence belongs to a big branch of machine learning, and the ensemble learning (Ensemble Learning) method, please refer to fig. 2, fig. 2 is a schematic training flow diagram of a prediction model provided in this embodiment, and as shown in fig. 2, the training of the prediction model includes the following steps:

s121, randomly extracting characteristic information of the first data set to generate a plurality of random sample sets;

s122, generating classification regression trees of a plurality of random sample sets based on a classification regression algorithm, and calculating the radix coefficients of the plurality of random sample sets;

s123, for a plurality of random sample sets where the current node of the classification regression tree is located, returning the classification regression algorithm to the sub-decision tree and stopping recursion when the coefficient of the foundation of the plurality of random sample sets is smaller than the coefficient threshold; or when the coefficient of the basis of the random sample sets is not smaller than the threshold value of the basis of the coefficient, calculating the coefficient of the basis of each piece of characteristic information in the random sample set where the current node is located;

s124, taking the characteristic information corresponding to the minimum coefficient of the coefficient of each characteristic information as optimal characteristic information, dividing a corresponding random sample set into a first information set and a second information set based on the optimal characteristic information, and taking the first information set and the second information set as a left child node and a right child node of the current node respectively;

s125, calculating the coefficient of the first information set and the second information set, and returning the classification regression algorithm to the sub-decision tree and stopping recursion when the coefficient of the first information set and the second information set is smaller than the threshold value of the coefficient of the first information set and the second information set, so as to generate a plurality of decision trees;

and S126, learning a plurality of decision trees by a plurality of classifiers to obtain an abdominal aortic aneurysm prediction model.

Specifically, in step S121, a plurality of feature information is randomly extracted from the first data set to form a new random sample set, and a plurality of classification regression trees are generated to facilitate the subsequent formation of random forests. And obtaining a prediction result by averaging values obtained by voting the classification regression tree. In this embodiment, each node is segmented by a random extraction method, and errors generated under different conditions, such as whether obesity affects abdominal aortic carcinoma, and whether smoking is performed, are compared to determine the influence of different feature information on the probability of illness. In a further embodiment, a plurality of characteristic information is randomly extracted from the first data set to obtain the random sample set, wherein the data format of the first data set is a data matrix of a plurality of rows and a plurality of columns.

In steps S122 and S123, the coefficient of kunning is an important criterion for measuring uncertainty of the classification regression algorithm. The smaller the coefficient of kunity, the lower the uncertainty and the better the performance. Since the random sample set is randomly sampled from the first data set, assuming that the probability of the abdominal aortic carcinoma patient in the random sample set is p, the coefficient of the kunity of the probability distribution is as follows

And (5) calculating to obtain the product.

Further, in the random sample set, the number of samples is defined as |d|, the number of patients with abdominal aortic carcinoma is defined as |c|, and the coefficient of the foundation expression is:

，Gini(D) Representing the coefficient of kunity.

Further, if the random sample set is divided into D1 and D2 by the optimal feature information a, the expression of the coefficient of the random sample set is defined as:

wherein D is ₁ Represents a first information set, D ₂ Representing a second set of information. Where Gini (D, a) represents the uncertainty of the random sample set and a represents the optimal feature information. D is derived from the optimal feature information packet, the smaller the Gini coefficient, the lower the uncertainty of the random sample set, so that the higher the accuracy of the final prediction result.

In step S124, for each random sample set, a classification regression tree is generated according to a classification regression algorithm starting from the root node. For the random sample set where the current node is located, if the coefficient of the random sample set is smaller than the coefficient threshold or no characteristic information to be selected is available, the classification regression algorithm returns to the sub-decision tree and stops recursion.

If the coefficient of the random sample sets is not smaller than the threshold value of the coefficient, the characteristic information corresponding to the minimum coefficient of the characteristic information is used as the optimal characteristic information, the corresponding random sample set is divided into a first information set and a second information set based on the optimal characteristic information, and the first information set and the second information set are respectively used as a left child node and a right child node of the current node.

Further, in step S125, the first information set and the second information set may be calculated by using the mathematical expression that the first information set and the second information set have the same coefficient, which is not described herein. On the basis, if the coefficient of the first information set and the second information set is smaller than the coefficient threshold, the classification regression algorithm returns to the sub-decision tree and stops recursion, so that a plurality of decision trees are generated.

In step S126, since the random forest algorithm is an integrated learning artificial intelligence algorithm, the algorithm is composed of a plurality of classifiers, and the classifiers vote or average the final output, so that the result of the random forest algorithm has higher accuracy and generalization performance, and thus learning and classifying a plurality of decision trees are completed through the plurality of classifiers, so as to obtain the abdominal aortic aneurysm prediction model.

S130, acquiring a second data set of the patient to be predicted, which corresponds to the characteristic information of the patient with the abdominal aortic aneurysm, inputting the second data set into the abdominal aortic aneurysm prediction model for prediction, and outputting a prediction result of the patient to be predicted.

Specifically, a second data set corresponding to characteristic information of a patient to be predicted and an abdominal aortic aneurysm patient is input to a trained abdominal aortic aneurysm prediction model for prediction, and a final prediction result is output, specifically as follows: dividing the second data set into a training set and a testing set according to a preset proportion; randomly sampling the characteristic information of the training set for a plurality of times to obtain a sampling sample set, wherein the sampled characteristic information is put back into a second data set for next random sampling after each random sampling; generating a plurality of weight values for the sample set based on the plurality of decision trees; and calculating the sum value of the plurality of weight values, and dividing the sum value of the plurality of weight values by the number of random sampling to obtain a prediction result of the patient to be predicted. The above process is a specific process of random forest algorithm calculation, and is common knowledge of a person skilled in the art, so this is not explained in detail, and it is to be understood that the prediction result is presented in a probability value, for example, a percentage. Further, the second data set is divided into a training set and a test set according to a preset ratio, which is common knowledge of a person skilled in the art, wherein the preset ratio is 8 to 2, and may be 7 to 3, which is not particularly limited in this embodiment.

In summary, as shown in the foregoing embodiments, the prediction method provided in the embodiment of the present application uses, as a first data set, characteristic information such as age, gender, smoking, obesity, hypertension, and family history of abdominal aortic aneurysm of a patient, and trains the first data set based on a random forest algorithm, thereby obtaining a prediction model based on the random forest algorithm, where the random forest is composed of a plurality of decision trees, and all the trees are independent of each other. When a sample set is predicted, which classification the sample set belongs to depends on the judgment of each decision tree, the prediction model itself consists of a plurality of classifiers, and then a final prediction result is obtained through voting and averaging.

In one embodiment, the method further comprises: presetting a matching rule of characteristic information of a first data set, correcting a weight value output by a decision tree based on the matching rule, and summing the corrected weight value to obtain a weight correction value; predicting the test set based on the abdominal aortic aneurysm prediction model to obtain a weight average value of the test set; and when the weight correction value is equal to the weight average value, outputting a prediction result of the patient to be predicted.

Specifically, the matching rule of the first data set of the characteristic information of the abdominal aortic aneurysm patient is specifically: using a set of

The rules classify the characteristic information, such as whether smoking, drinking, etc. For example, the matching rule for the feature information of the first data set may be expressed as follows: />

，/>

Wherein, wherein->

Regular expression representing the i-th characteristic information of the first dataset,/or->

Rule calculation result representing the ith characteristic information, < +.>

Indicating whether the ith characteristic information is contained, e.g. whether the patient smoked or not。

Performing a connection operation on the matching rules of the plurality of feature information to obtain an average value of the matching rules, for example, if

Wherein i represents the i-th characteristic information, < +.>

Representing a join operation. Further, the weight value output by the decision tree is corrected based on the average value of the obtained matching rules.

In this embodiment, considering the problem of reliability of the prediction result of the patient to be predicted for the abdominal aortic aneurysm, the decision process of the algorithm is explained using a voting tree based on a random forest method. Random forests are mainly composed of a number of decision trees. In order to clearly explain the decision process of the random forest, the embodiment takes each feature information as a decision tree, and deduces the voting process of the random forest algorithm. In the embodiment, how to verify the reliability degree of the prediction result of the abdominal aortic aneurysm prediction model is further considered, the decision process of the random forest algorithm is explained based on the voting tree of the random forest algorithm, and the random forest mainly comprises a plurality of decision trees. In order to clearly explain the decision process of the random forest, each characteristic information is taken as a decision tree, and the voting process of the random forest algorithm is deduced. In the embodiment, an interpretable rule is used to assist the derivation process of the random forest, firstly, the output result of a decision tree is corrected by using a matching rule, then the corrected result is further verified according to a label in a test set, under the condition that a data set is unchanged, if the corrected result is equal to a weight average value obtained by the test set through an abdominal aortic aneurysm prediction model, the prediction result has higher prediction precision, and the prediction process is transparent to a doctor, so that a trusted prediction model is provided for early prediction of the abdominal aortic aneurysm, a trusted relation is established between the doctor and an artificial intelligent system, and the reliability of the final prediction result is ensured.

In one embodiment, a plurality of output results of the abdominal aortic aneurysm prediction model are selected by adopting a receiving operator characteristic curve, so that the abdominal aortic aneurysm outputs a prediction result of a patient to be predicted.

Specifically, in the random sample set, the number of positive samples is the same as the number of negative samples, and the output of the samples is positive or negative, and the reliability of the plurality of classifiers is evaluated by using a receiver operator characteristic curve (ROC) in the embodiment. The multiple classifiers classify the samples as positive or negative by two classifier labels, which can be represented by a confusion matrix, of four types, which is prior art, and is not specifically explained in excess. True Positive Rate (TPR) refers to the percentage of correctly predicted positive samples among all samples for which the predicted result is positive; false Positive Rate (FPR) refers to the percentage of samples that are mispredicted as positive among all samples with negative truth;

the receive operator profile defines a False Positive Rate (FPR) as the X-axis and a True Positive Rate (TPR) as the Y-axis. The lower Area (AUC) of the receive operator characteristic curve refers to the area between the receive operator characteristic curve and the X-axis, the larger the AUC value, the closer the curve is to the upper left, which indicates that the higher the proportion of correct samples, the lower the proportion of erroneous samples. If the AUC has a value of 0.9-1: perfect precision; 0.7-0.9: high precision; less than 0.7 is the same as the random effect.

Based on the working principle of the receiving operator characteristic curve, since the abdominal aortic aneurysm prediction model can obtain a plurality of prediction results, the method realizes the screening of the accuracy of the plurality of prediction results based on the receiving operator characteristic curve in the machine learning algorithm, so that the accuracy of the finally output prediction results is ensured.

Based on the same inventive concept, the present embodiment also provides an early abdominal aortic aneurysm predicting system based on a random forest algorithm, which corresponds to the early abdominal aortic aneurysm predicting method based on a random forest algorithm, so that the early abdominal aortic aneurysm predicting system based on a random forest algorithm of the present embodiment can implement the early abdominal aortic aneurysm predicting method based on a random forest algorithm shown in the foregoing embodiment, and in order to avoid repetition, the system is not described here, as shown in fig. 3, and includes:

an acquisition module 310 for acquiring a first data set of characteristic information of an abdominal aortic aneurysm patient, wherein the characteristic information comprises age, gender, whether smoking, obesity, hypertension and whether there is a family history of abdominal aortic aneurysm;

a training module 320, configured to train the first data set based on a random forest algorithm to obtain an abdominal aortic aneurysm prediction model;

the prediction module 330 is configured to obtain a second data set of the patient to be predicted, which corresponds to the characteristic information of the patient with abdominal aortic aneurysm, input the second data set into the abdominal aortic aneurysm prediction model for prediction, and output a prediction result of the patient to be predicted.

In one embodiment, the training module includes:

In one embodiment, the system further comprises:

It can be seen that the early abdominal aortic aneurysm prediction system based on the random forest algorithm in the above embodiment has the following beneficial effects: 1. the method comprises the steps of taking characteristic information such as age, sex, smoking, obesity, hypertension and family history of abdominal aortic aneurysm of a patient as a first data set, training the first data set based on a random forest algorithm, so that a prediction model based on the random forest algorithm is obtained, wherein a random forest consists of a plurality of decision trees, and all trees are independent of each other. When a sample set is predicted, which classification the sample set belongs to depends on the judgment of each decision tree, the prediction model itself consists of a plurality of classifiers, and then a final prediction result is obtained through voting and averaging.

2. Further consider how to verify the reliability degree of the prediction result of the abdominal aortic aneurysm prediction model, the decision making process of the algorithm is explained based on a voting tree of a random forest algorithm, and the random forest mainly comprises a plurality of CART trees. In order to clearly explain the decision process of the random forest, each characteristic information is taken as a decision tree, and the voting process of the random forest algorithm is deduced. The prediction result of the traditional random forest algorithm is unstable, an interpretable rule is used for assisting in the derivation process of the random forest, firstly, the output result of a decision tree is corrected by using a matching rule, then the correction result is further verified according to the label in the test set, and under the condition that the data set is unchanged, if the correction result is equal to the weight average value obtained by the abdominal aortic aneurysm prediction model on the test set, the prediction result is indicated to have higher prediction precision.

The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the invention, and is not meant to limit the scope of the invention, but to limit the invention to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. An early abdominal aortic aneurysm prediction method based on a random forest algorithm is characterized by comprising the following steps:

2. The early abdominal aortic aneurysm prediction method based on the random forest algorithm according to claim 1, wherein the first data set is trained based on the random forest algorithm to obtain an abdominal aortic aneurysm prediction model, specifically:

3. The early prediction method of abdominal aortic aneurysm based on random forest algorithm according to claim 2, wherein the second data set is input into the abdominal aortic aneurysm prediction model for prediction, and the prediction result of the patient to be predicted is output, specifically:

4. The method for early prediction of abdominal aortic aneurysm based on random forest algorithm according to claim 3, wherein the method further comprises:

5. The method of claim 2, wherein a plurality of characteristic information is randomly extracted from the first dataset to obtain the random sample set, wherein the data format of the first dataset is a multi-row multi-column data matrix.

6. The early abdominal aortic aneurysm prediction method based on the random forest algorithm according to claim 1, wherein a plurality of output results of the abdominal aortic aneurysm prediction model are selected by adopting a receiving operator characteristic curve so that the abdominal aortic aneurysm outputs a prediction result of a patient to be predicted.

7. An early abdominal aortic aneurysm prediction system based on a random forest algorithm, comprising:

8. The early abdominal aortic aneurysm prediction system based on random forest algorithm according to claim 7, wherein the training module comprises:

9. The early abdominal aortic aneurysm prediction system based on random forest algorithm according to claim 8, wherein the prediction module is further configured to: dividing the second data set into a training set and a testing set according to a preset proportion; randomly sampling the characteristic information of the training set for a plurality of times to obtain a sampling sample set, wherein the sampled characteristic information is put back into a second data set for next random sampling after each random sampling; generating a plurality of weight values for the sample set based on the plurality of decision trees; and calculating the sum value of the plurality of weight values, and dividing the sum value of the plurality of weight values by the number of random sampling to obtain a prediction result of the patient to be predicted.

10. The early abdominal aortic aneurysm prediction system based on random forest algorithm according to claim 9, wherein the system further comprises: