CN116502742A - Risk behavior information prediction method and system based on gradient lifting decision tree - Google Patents

Risk behavior information prediction method and system based on gradient lifting decision tree Download PDF

Info

Publication number
CN116502742A
CN116502742A CN202310161573.1A CN202310161573A CN116502742A CN 116502742 A CN116502742 A CN 116502742A CN 202310161573 A CN202310161573 A CN 202310161573A CN 116502742 A CN116502742 A CN 116502742A
Authority
CN
China
Prior art keywords
prediction
data
risk behavior
decision tree
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310161573.1A
Other languages
Chinese (zh)
Inventor
孟祥忠
王亦冰
吕茜茜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Minzhi Digital Technology Co ltd
Original Assignee
Beijing Minzhi Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Minzhi Digital Technology Co ltd filed Critical Beijing Minzhi Digital Technology Co ltd
Priority to CN202310161573.1A priority Critical patent/CN116502742A/en
Publication of CN116502742A publication Critical patent/CN116502742A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Engineering & Computer Science (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Educational Administration (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of artificial intelligence, and discloses a risk behavior information prediction method and a system based on a gradient lifting decision tree, which are used for acquiring scale evaluation data, atrial appendage painting characteristic data and expert interview evaluation data, preprocessing the three data to form data characteristics to be identified of a model, and synthesizing the data into combined evaluation data by using a range function; the model takes the characteristics of the combined evaluation data as a prediction variable, and takes the crowd with risk behaviors in expert interview evaluation data as a prediction target; and finally, the model is used for predicting new individual risk behaviors, the behavior prediction variables are input into the gradient lifting decision tree prediction model, risk behavior indexes are output, and risk behavior levels are divided. The invention can avoid the data authenticity deviation generated by a single data source, accurately predict the risk behaviors of the soldier, output the predicted variable importance degree of the model, and be beneficial to providing effective intervention means for the soldier with the risk behaviors by professionals.

Description

Risk behavior information prediction method and system based on gradient lifting decision tree
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a risk behavior information prediction method and system based on a gradient lifting decision tree.
Background
At present, the acquisition of risk behavior data is mostly limited to one of a psychological scale evaluation, a projection test and an expert interview evaluation method, and the risk behavior data cannot be accurately acquired by using the most psychological scale evaluation, so that a lie tendency often exists. The projection test can overcome the problems of excessive characters, lie tendency and the like, and can detect the thought with more real mind through instantaneous and non-thought reaction, but the multidimensional data is difficult to acquire. Expert interview assessment methods tend to obtain more accurate risk performance data and levels through face-to-face assessment, but are cost-effective and difficult to assess on a large scale. There is currently a lack of data analysis methods to effectively combine the three types of data to avoid sample data errors to the greatest extent possible.
In the context of high-dimensional big data focused on psychological research, the dimensions involved in risk behavior evaluation are more and more, the acquired data structure is more and more complex, and higher requirements are put on a data analysis algorithm. Risk behavior prediction models based on Machine Learning (ML) models are gradually emerging in teenager populations and pathological populations, mainly comprising prediction models such as Support Vector Machines (SVM), random forests (RSF), decision Trees (DT), extreme gradient enhancement algorithms (XGBoost) and the like, but the effectiveness and accuracy of each model are influenced by the authenticity of the evaluation dimensional data, population characteristics and algorithm applicability. Most models are single in analyzed data, and data is only evaluated based on psychological scales, so that the data is often not real enough, and the effectiveness of the models is greatly reduced. And no soldier risk behavior information prediction method based on a Machine Learning (ML) model is found at present.
Through the above analysis, the problems and defects existing in the prior art are as follows: the existing risk behavior data has insufficient accuracy and a soldier risk behavior information prediction method based on a Machine Learning (ML) model is lacking, and the current practical situation based on the increasing of the risk behaviors of domestic and foreign soldiers and the defect that the intervention work of the soldier behaviors tends to be in the way of 'treatment' and 'prevention'.
Disclosure of Invention
Aiming at the problems existing in the prior art, the invention provides a risk behavior information prediction method and system based on a gradient lifting decision tree.
The invention is realized in such a way that a risk behavior information prediction method based on gradient lifting decision tree
The system respectively acquires scale evaluation data, treemap drawing characteristic data and expert interview evaluation data through psychological scale evaluation, treemap projection test and expert interview evaluation, preprocesses the three data to form data characteristics to be identified of the model, and synthesizes the scale evaluation data and the treemap drawing characteristic data into combined evaluation data by applying a range function; the model takes the characteristics of the combined evaluation data as a prediction variable, takes the crowd with risk behaviors in expert interview evaluation data as a prediction target, randomly divides the data into a training set and a test set, establishes a gradient lifting decision tree model by using the training set data, sends the data characteristics of the test set into a machine learning model, and verifies the model effect; and finally, the model is used for predicting new individual risk behaviors, the behavior prediction variables are input into the gradient lifting decision tree prediction model, risk behavior indexes are output, and risk behavior levels are divided.
Further, the risk behavior information prediction method based on the gradient lifting decision tree comprises the following steps:
step one, respectively obtaining scale evaluation data, room tree person drawing characteristic data and risk behavior judgment scores through psychological scale evaluation, room tree person projection test and expert interview evaluation;
step two, preprocessing the data, carrying out normalization processing to form a unified data format of 0 to 1, correcting deviation of the same questions or dimensions in the scale evaluation data and the house tree drawing characteristic data by using a range function to form combined evaluation data, and applying Spearman moment correlation, point two-column correlation analysis and χ 2 Checking and removing one of the variables with higher correlation, and removing the lack of tributes to the predicted target by using a binary logistic regression data analysis methodRedundancy variables of donation;
step three, training set and test set data are established, firstly, in the training set, people with risk behaviors in risk behavior judgment scores are taken as prediction targets, 15 variables such as obvious variable depression, pressure life events, social support, burdensome feeling and childhood adversity are predicted by binary logistic regression, a soldier risk behavior gradient promotion decision tree prediction model is established, and the performance of the decision tree prediction model is promoted by utilizing test set test gradients; in addition, performance comparison is carried out on the same data set with 4 different data analysis methods of binary logistic regression, support vector machine, random forest and extreme gradient enhancement model, so that the accuracy and reliability of the gradient enhancement decision tree model are further proved.
And step four, inputting a new individual risk behavior prediction variable into a gradient lifting decision tree prediction model, outputting a risk behavior prediction value, converting the risk behavior prediction value into a risk behavior index by using a function formula, and dividing the risk behavior level according to a certain rule.
Further, the normalizing process of the data in the second step includes:
normalizing the data to form a unified data format of 0 to 1, dividing psychological problems into n questions, and collecting 1-5 and 1-2 of original data of table evaluation data, house tree drawing characteristic data and risk behavior evaluation score; all data normalization is converted into a numerical value between 0 and 1, and the normalization formula is as follows:
wherein x is i Normalized value of the ith question, t i Value of the i-th subject of sample data, t i min For the minimum value of the ith question of all collected sample data, t i max The maximum value of the ith question item is the acquired all sample data.
Further, in the second step, spearman moment correlation, point two-column correlation analysis and χ are utilized 2 Checksum twoThe method for analyzing the meta-logistic regression data comprises the following steps of: using Spearman moment correlation, point two column correlation analysis, χ 2 Checking and calculating the association among the predicted variables, and deleting the variables with overlarge association, wherein the method specifically comprises the following steps of:
(1) The correlation of 15 continuous prediction variables such as the time to five, the soldier's rank, the death intolerance, the nerve quality, the exotropy and the like is calculated by utilizing the Spearman moment correlation, and the calculation formula is as follows:
in the method, in the process of the invention,distance from one predictive variable to average, < +.>Distance to average for another predicted variable; analysis finds that the attribution sense and the social support have strong correlation, which indicates that the measured psychological trait is very similar, consults risk behavior related documents, and removes attribution sense prediction variables.
(2) The correlation of 5 classification variables such as gender, mental disorder and the like and 15 continuous variables such as age, time to wood, soldier's rank, nerve matter, exotropy and the like is calculated by using a point two-column correlation analysis, and a calculation formula is as follows:
in the method, in the process of the invention,is the average number of continuous variables corresponding to one value of the bivariate; />Is another with binary variableAverage number of continuous variables corresponding to one value; the ratio of the two values of the p and q binary variables; s is(s) t Is the standard deviation of the continuous variable; the results show that the correlation coefficients among the variables are all lower than 0.4, and the correlation is weaker.
(3) The correlation between 5 classification variables such as gender, mental disorder and the like is calculated by using the χ2 test analysis, and the calculation formula is as follows:
Wherein f 0 Actual number of observations; f (f) e Theoretical number of observations.
The binary logistic regression is used for selecting data related to a predicted target, and specifically comprises the following steps:
(1) For 24 regression prediction variables X 1 ,X 2 ,X 3 ,…,X 24 Respectively establishing a unary logistic regression model with the predicted target risk behaviors;
Odds=e β0+βiX1+ε
Log(Odds)=β 0i X 1 +ε,i=1,…,P;
odds = risky/risky;
calculating variable X 1 ,X 2 ,X 3 ,…X 24 The value of the test statistic of the corresponding regression coefficient is denoted as F 1 (1) ,…,F 22 (1) Taking the maximum F i1 (1) Taking the maximum value, then:
F i1 (1) =max{F 1 (1) ,…,F 24 (1) };
for a given significance level of 0.05, the corresponding threshold is noted as F (1) ,F i1 (1) >F (1) X is then i1 Introducing regression model, record I 1 To select a variable index set.
(2) Establishing a prediction target Log (Odds) and a subset { X } of prediction variables i1 ,X 1 },…,{X i1 ,X i1-1 },{X i1 ,
X i1+1 },…,{X i1 ,X 24 A binary regression model; calculating the statistical magnitude of the regression coefficient F test of the variable, which is recorded asThe largest one is selected and marked as F i2 (2) Corresponding prediction variable foot is marked as i 2 Then:
F i2 (2) =max{F 1 (2) ,…,F i1-1 (2) ,F i1+1 (2) ,…,F p (2) };
for a given significance level of 0.05, the corresponding threshold is noted as F (2) ,F i2 (2) >F (2) Then variable X i2 Introducing a regression model; otherwise, the variable introduction process is terminated.
(3) Variable subset { X } based on predicted variables i1 ,X i2 ,X k The regression of the three-dimensional model is repeated in the step (2), one of the predicted variables which are not introduced into the regression model is selected each time until no variable is introduced through inspection, and 15 variables such as depression, pressure life events, social support, burdensome feeling, and childhood adversity are finally selected as the predicted variables.
In the third step, the crowd with risk behaviors in the risk behavior judgment score is taken as a prediction target, 15 variables such as obvious variable depression, pressure life events, social support, burdensome feeling, adversity in childhood and the like which are predicted by binary logistic regression are taken as prediction variables, and a gradient lifting decision tree algorithm is used for establishing a prediction model of sample data.
The method for establishing the prediction model of the sample data by utilizing the gradient lifting decision tree algorithm for predicting the soldier risk behaviors comprises the following steps: randomly partitioning the sample dataset into a ratio of 3: the training set and the testing set are used for training a gradient lifting decision tree prediction model, and proper super parameters are set according to the highest prediction accuracy of the model; the independent test set is only used for verifying and evaluating the model, the gradient lifting decision tree prediction model is trained on the new balance training set, multi-azimuth performance indexes of the evaluation model are independently verified, and the relative importance weights of the prediction variables are output. Inputting the new predicted variable into a prediction model, outputting a risk behavior index, and classifying risk grades according to a certain standard.
The gradient lifting decision tree contains a plurality of decision trees, and the prediction model is generated by the results of all the decision trees together.
The algorithm flow for carrying out the gradient lifting decision tree algorithm for statistical classification on the risk behavior sample comprises the following steps:
respectively extracting predicted variables and carrying out normalization processing, and removing redundancy and repeated predicted variables by using correlation analysis and binary logistic regression; and taking the predicted variable as an input sample of the model, continuously training and learning, and finally, outputting the model as a risk behavior predicted result.
The prediction model training and learning process comprises the following steps: inputting the obtained prediction variable into a 1 st gradient lifting decision tree to obtain the estimation of a model on a training sample; calculating a model residual error based on the obtained sample estimation result; and (3) repeating training the 2 nd model based on the original sample input information and the residual error until M models are trained, and finally obtaining a predicted result of the risk behaviors.
Training data set t= { (x) for risk behaviors containing N samples 1 ,y 1 ),...,(x N ,y N ) The gradient lifting decision tree algorithm flow comprises:
(1) Initializing learner
Wherein f 0 (x) For an initial tree with only one root node, c is a constant that minimizes the loss function, L (y i C) is a loss function for calculating the difference between the target value and the calculated value, y i Is the ith training data.
The log-likelihood function is introduced as a loss function to reduce residual loss of the sample, and the expression is:
l(y,f(x))=log(1+exp(-yf(x)));
(2) Let the number of iterations m=1, 2,.. then for each sample i=1, 2,.. and calculating the negative gradient of the ith training sample, wherein the calculation formula of the residual error is as follows:
taking the obtained residual value as the true value of the new sample, and according to the sample and the negative gradient direction (x, r mi ) (i=1, 2,.,. N.) computing fits the residual values to a decision tree T consisting of J leaf nodes m The corresponding leaf node area is R mj (j=1, 2,., J), then there are best fit values for each leaf node:
updating the strong learner, then:
where I is the display function of the ith training sample at the jth leaf node area.
(3) After M rounds of iteration, the final learning is obtained, and then:
f 0 (x) C for an initial tree with only one root node mj To minimize the constant of the loss function, I is the explicit function of the ith training sample in the jth leaf node area, resulting in a decision tree T consisting of J leaf nodes m The corresponding leaf node area is R mj (j=1,2,...,J)。
The output prediction variable relative importance weights include:
for a single decision tree T, the importance is obtained by calculating according to the number of times that the variable is selected as a decision tree splitting variable in the iterative process, and the importance is obtained by calculating according to the following formula:
wherein J-1 is the number of non-leaf nodes, v t Is a feature associated with the non-leaf node t,is the reduced value of the node after splitting in a square error mode;
for the set of decision trees { T ] m } M The global importance of a feature variable is measured by its average value of importance in a single decision tree, as shown in the following equation:
where M is the number of decision trees,is the importance of the predictive variable k in the mth decision tree, and the sum of the importance of all predictive variables is 1.
In the fourth step, a new risk behavior prediction variable is input into a risk behavior gradient lifting gradient tree prediction model, and a corresponding risk behavior prediction value is output; and forming risk behavior indexes according to risk behavior predicted values of different individuals, and dividing risk behavior levels.
And on the premise that the risk behavior gradient lifting decision tree prediction model exists, conveying the new individual prediction variable to the risk behavior gradient lifting decision tree prediction model.
Calculating a plurality of sub-decision trees of the risk behaviors by using a risk behavior gradient lifting decision tree prediction model, and generating a multi-decision tree prediction value data set by using the predicted risk behavior prediction values, wherein one piece of data corresponds to one sub-decision tree; the risk behavior gradient lifting decision tree prediction model sends a multi-decision tree prediction value data set to a risk behavior level prediction and judgment module.
Carrying out normalized calculation on all predicted values in the multi-decision tree predicted value dataset by using a risk behavior level prediction and judgment module:wherein x is a value to be processed, mapping a previous predicted result value into a real number in a (0, 1) interval by using a function, and judging based on the value; when the S (x) value is larger than the threshold value, judging that the risk-free behavior exists; when the S (x) value is less than the threshold, a risky behavior is determined.
Outputting a risk behavior predicted value with a value of 0 to 1, converting data with a value of 0 to 1 into a percentile to form a risk behavior index with a 0-100 percentile, and dividing different risk behavior levels according to different risk behavior indexes.
Another object of the present invention is to provide a risk behavior information prediction system applying the risk behavior information prediction method based on a gradient lifting decision tree, where the risk behavior information prediction system based on the gradient lifting decision tree includes:
the data acquisition module is used for collecting risk behavior information evaluation data through three risk behavior evaluation methods of psychological scale evaluation, atrial treetop projection test and expert interview evaluation;
the data preprocessing module is used for carrying out normalization processing on the data and utilizing Spearman moment correlation, point two-column correlation analysis and χ 2 Removing redundant variables by a test and binary logistic regression data analysis method;
the model training module is used for establishing training set and test set data, training gradient lifting decision tree prediction models by using the training set, and lifting decision tree prediction model performance by using test set checking gradient;
the risk behavior prediction and judgment module is used for inputting a risk behavior prediction variable into the gradient lifting decision tree prediction model, outputting a risk behavior prediction value, determining a risk behavior index and automatically dividing a risk behavior level.
Another object of the present invention is to provide a computer device, which includes a memory and a processor, wherein the memory stores a computer program, and the computer program when executed by the processor causes the processor to execute the steps of the risk behavior information prediction method based on the gradient boost decision tree.
Another object of the present invention is to provide a computer readable storage medium storing a computer program, which when executed by a processor, causes the processor to execute the steps of the risk behavior information prediction method based on a gradient boost decision tree.
The invention further aims to provide an information data processing terminal which is used for realizing the risk behavior information prediction system based on the gradient lifting decision tree.
In combination with the technical scheme and the technical problems to be solved, the technical scheme to be protected has the following advantages and positive effects:
first, aiming at the technical problems in the prior art and the difficulty of solving the problems, the technical problems solved by the technical proposal of the invention are analyzed in detail and deeply by tightly combining the technical proposal to be protected, the results and data in the research and development process, and the like, and some technical effects brought after the problems are solved have creative technical effects. The specific description is as follows:
the invention provides a risk behavior information prediction method based on a gradient lifting decision tree, which is characterized in that a psychological scale evaluation method, a projection test and an expert interview evaluation method are organically combined for eliminating deviation generated by single data, a homemade soldier risk behavior evaluation tool is used for collecting soldier group related data, and a binary Logistic Regression (LR), a Support Vector Machine (SVM), a random forest (RSF), an extreme gradient enhancement (XGBoost) and 5 different machine learning models of the gradient lifting decision tree (GBDT) are used for performing performance comparison on the same data set, so that the optimal performance of the gradient lifting decision tree model is finally determined, a risk behavior gradient lifting decision tree prediction model is formed, a new individual risk behavior prediction variable is input into the prediction model, a risk behavior prediction value is output, a risk behavior index is formed, a risk behavior level is divided, and meanwhile, the model also outputs the relative importance of the prediction variable, so that the purpose of accurately predicting the risk behaviors of soldiers can be realized, and a powerful data support is provided for prevention and intervention of the soldier psychological crisis.
Secondly, the technical scheme is regarded as a whole or from the perspective of products, and the technical scheme to be protected has the following technical effects and advantages:
the risk behavior information prediction system is developed based on the soldier risk behavior information prediction method, the system is characterized by accurate investigation of risk behavior information, a gradient lifting decision tree model and other data analysis methods are reasonably utilized, three data including scale evaluation data, house tree drawing feature data and risk behavior judgment score are organically combined, a risk behavior prediction model is established, the prediction performance of the model is good, the accuracy, the sensitivity and the specificity are 83.74%, 85.76% and 81.71% respectively, the risk behavior of individuals or groups can be effectively screened, risk levels are defined, the system is automatically processed in the whole process, and firstly, soldiers with risk behavior screening requirements can independently complete risk behavior assessment, and secondly, professionals can be helped to more effectively and effectively complete large-scale risk behavior information screening work.
Thirdly, as inventive supplementary evidence of the claims of the present invention, the following important aspects are also presented:
the technical scheme of the invention fills the technical blank in the domestic and foreign industries:
The army risk behavior prediction method uses a range function to correct the scale evaluation data in each sample data and the data deviation in the house tree person drawing test to form combined evaluation data, and fully considers the authenticity of data acquisition. When a risk behavior prediction model of a soldier gradient lifting decision tree is established, risk behavior judgment scores obtained by expert interview evaluation are taken as prediction targets, questions or dimension data of combined evaluation data are taken as prediction variables, various data sources are comprehensively combined, lie tendency in psychological scale evaluation is avoided to the greatest extent, a soldier risk behavior information gradient lifting tree prediction model is established creatively, the model prediction performance is good, the soldier risk behavior detection rate reaches more than 85%, professional personnel can timely give intervention measures, great significance is brought to prevention of occurrence of soldier risk behaviors, and non-combat force personnel can be reduced to the greatest extent.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments of the present invention will be briefly described below, and it is obvious that the drawings described below are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flowchart of a risk behavior information prediction method based on a gradient lifting decision tree provided by an embodiment of the present invention;
FIG. 2 is a flowchart of a risk behavior information prediction system based on a gradient boosting decision tree according to an embodiment of the present invention;
FIG. 3 is a schematic illustration of a GBDT process according to an embodiment of the present invention;
fig. 4 is an algorithm flow chart of performing a gradient boost decision tree algorithm on risk behavior samples for statistical classification according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
The embodiment of the invention provides a soldier risk behavior information prediction system for detecting whether risk behavior information exists or not and determining a risk level. The system structure comprises a data acquisition module, a data preprocessing module, a data model training module and a risk behavior prediction and judgment module. The data acquisition module acquires three kinds of evaluation data, namely scale evaluation data, house tree person drawing characteristic data and risk behavior evaluation score through an evaluation tool system platform. The data preprocessing and training module automatically preprocesses the acquired data through a data processing and analyzing system platform to form data characteristics to be identified of the model, and realizes the processing of redundant data and variables through network technology coding; randomly dividing data into a training set and a test set, establishing a gradient lifting decision tree prediction model by using the training set data, sending a test set prediction variable into the prediction model, and verifying the model effect; in addition, the performance comparison of the same data set and a binary logistic regression, a support vector machine, a random forest and an extreme gradient enhancement prediction model further proves the accuracy and reliability of the gradient enhancement decision tree model; and finally, outputting the importance of the predicted variables to the model, and evaluating the relative importance of each predicted variable to the predicted model. The risk behavior prediction and judgment module inputs new risk behavior prediction variables into the gradient lifting decision tree prediction model through the big data storage and display platform, outputs the risk behavior index of each test individual, and divides the risk level according to the division standard.
Aiming at the problems existing in the prior art, the invention provides a risk behavior information prediction method and a risk behavior information prediction system based on a gradient lifting decision tree, and the invention is described in detail below with reference to the accompanying drawings.
As shown in fig. 1, the risk behavior information prediction method based on the gradient lifting decision tree provided by the embodiment of the invention includes the following steps:
s101, respectively acquiring scale evaluation data, room tree person drawing characteristic data and expert interview evaluation data through psychological scale evaluation, room tree person projection test and expert interview evaluation;
s102, preprocessing data, normalizing to form a 0-1 unified data format, correcting deviations of the same title or dimension in the scale evaluation data and the house tree drawing characteristic data by using a range function to form combined evaluation data, and applying Spearman moment correlation, point two-column correlation analysis and χ 2 Checking and removing one variable with higher correlation, and removing redundant variables which do not contribute to a predicted target by using a binary logistic regression data analysis method;
s103, establishing training set and test set data, firstly, in the training set, taking the crowd with risk behaviors in the risk behavior judgment score as a prediction target, taking 15 variables such as obvious variable depression, pressure life event, social support, burdensome feeling, adversity in childhood period and the like of binary logistic regression prediction, establishing a soldier risk behavior gradient lifting decision tree prediction model, and utilizing the test set to test gradient lifting decision tree prediction model performance; in addition, performance comparison is carried out on the same data set with 4 different data analysis methods of binary logistic regression, support vector machine, random forest and extreme gradient enhancement algorithm, so that the accuracy and reliability of the gradient enhancement decision tree model are further proved;
S104, inputting new individual prediction variables into a gradient lifting decision tree prediction model, outputting risk behavior prediction values, converting the risk behavior prediction values into risk behavior indexes by using a function formula, and dividing risk behavior levels according to a certain rule.
Example 1
The risk behavior information prediction method based on the gradient lifting decision tree provided by the embodiment of the invention specifically comprises the following steps:
step S1: the on-line platform collects scale evaluation data by using homemade soldier risk behavior evaluation scales, and consists of personal basic information (4 dimensions of gender, time to live and the like), physiological factors (mental disorders, somatic diseases, substance abuse and dependence), psychological factors (personality characteristics comprise nervosa, exotropy, cognitive characteristics comprise 7 dimensions of thinking stiffness, coping modes and ruminant thinking, emotion characteristics comprise anxiety, depression, frustration and the like), and social factors (4 dimensions of stress life events, social support and the like). In addition, the invention also obtains the expressed risk behavior psychology or behavior characteristics and physiological, psychological and social factors influencing the risk behaviors through analysis of the house tree man projection test, and obtains more objective and real risk behavior house tree man drawing characteristic data so as to furthest correct data deviation caused by the pubic feeling and lie tendency. And acquiring risk behavior evaluation scores through expert interview evaluation methods, and classifying the risk behavior evaluation scores into risk-free behaviors and risk-bearing behaviors as prediction targets of a prediction model.
Step S2: preprocessing the data, and normalizing to form a unified number of 0 to 1According to the format, correcting deviation of the same subject or dimension in the scale evaluation data and the house tree person drawing characteristic data by using a range function to form combined evaluation data; using Spearman moment correlation, point two column correlation analysis, χ 2 Checking to calculate the association between the predicted variables, and deleting one variable with overlarge association; potential predicted variables are detected using binary logistic regression, stepwise regression is performed to select significant predicted variables in the logistic regression.
Step S3: taking the crowd with risk behaviors in the risk behavior judgment score as a prediction target, taking 15 variables such as obvious variable depression, pressure life events, social support, burdensome feeling, adversity in childhood period and the like predicted by binary logistic regression as prediction variables, and establishing a prediction model of sample data by using a gradient lifting decision tree algorithm.
The gradient lifting decision tree model for risk behavior prediction provided by the embodiment of the invention is trained by firstly randomly dividing a sample data set into a sample data set with the proportion of 3: and 1, a training set and a testing set, wherein the training set is used for training a gradient lifting decision tree prediction model, and proper super parameters are set according to the highest prediction accuracy of the model. And secondly, the independent test set is only used for verifying and evaluating the model, proper super parameters are set for training the gradient lifting decision tree model on the new balance training set, and the multi-azimuth performance indexes of the evaluation model are independently verified. In addition, the invention performs performance comparison with 4 different data analysis methods of binary Logistic Regression (LR), support Vector Machine (SVM), random forest (RSF) and extreme gradient enhancement (XGBoost) algorithm on the same data set, and proves the accuracy and reliability of the gradient lifting decision tree model; and finally, storing the obtained optimal machine learning model for predicting risk behaviors of the soldier individuals in the future.
Step S4: inputting a new individual risk behavior prediction variable into a risk behavior gradient lifting gradient tree prediction model, and outputting a corresponding risk behavior prediction value; and forming risk behavior indexes according to risk behavior predicted values of different individuals, and dividing risk behavior levels.
The risk behavior information prediction system based on the gradient lifting decision tree provided by the embodiment of the invention comprises the following steps:
referring to fig. 2, a data acquisition module S201 is configured to collect risk behavior information evaluation data by three risk behavior evaluation methods including mental scale evaluation, treemap projection test and expert interview evaluation;
the data preprocessing module S203 is used for carrying out normalization processing on the data and utilizing Spearman moment correlation, point two-column correlation analysis and χ 2 Removing redundant variables by a test and binary logistic regression data analysis method;
the model training module S203 is used for establishing training set and test set data, training gradient lifting decision tree prediction model by using the training set, and lifting decision tree prediction model performance by using test set checking gradient;
the risk behavior prediction and judgment module S204 is configured to input a risk behavior prediction variable to the gradient lifting decision tree prediction model, output a risk behavior prediction value, convert the risk behavior prediction value into a risk behavior index by using a certain rule, and divide a risk behavior level.
In order to prove the inventive and technical value of the technical solution of the present invention, this section is an application example on specific products or related technologies of the claim technical solution.
The method for predicting the risk behavior information of the soldier by taking the computer system as the carrier can realize the dual purposes of autonomous evaluation and large-scale automatic evaluation of individuals, the detection rate of the risk behavior of the soldier reaches more than 85 percent, and the method is mature to be applied to products such as evaluation of the adaptability of new soldiers to be on the way, official and soldier mental health common measurement, and army mental crisis intervention service.
The most important dimension of the evaluation of the adaptive capacity of the new soldier is risk behavior screening, the risk behavior information of the new soldier can be accurately detected, main prediction variables are found, and the functions are combined with a risk behavior information prediction system, so that the accurate prediction, automatic monitoring and intelligent early warning have the basis and the way of pushing and realizing in theory and technology level, and the breakthrough and achievement in the subject field of 'soldier risk behavior information prediction' can provide kinetic energy for effectively reducing and controlling the risk behavior of the soldier.
The physical health measurement of the soldier mainly aims at investigation of physical health conditions of the soldier, and due to the special work of the soldier, the soldier often fights against the first line of various disaster sites and most directly faces the harsh event itself, so that psychological stress response phenomenon occurs due to severe impact and injury. These psychological stress responses must be intervened and handled in time to ensure physical and mental health of the officers and soldiers. Therefore, it is necessary to regularly conduct mental health census on mental health conditions of soldiers, and officers and soldiers with excessive mental stress or serious risk behavior problems are rapidly screened out through a risk behavior gradient lifting decision tree prediction method path combined with other assessment methods, so that mental staff and service staff are helped to timely pay attention to and crisis intervention, mental stress is relieved, and the soldier's line of management and defense is enhanced.
In the mental crisis intervention service work of the army, the mental staff takes an army risk behavior information prediction system as an important diagnostic tool, acquires predicted variable data of the army needing crisis intervention in the system, automatically outputs risk behavior indexes and levels, determines the crisis intervention level, and can fully reflect key mental problems of the risk behavior staff, so that the mental staff is provided with the most effective working direction.
Example 2
The risk behavior information prediction method based on the gradient lifting decision tree provided by the embodiment of the invention specifically comprises the following steps:
step S1: the on-line platform collects scale evaluation data by using homemade soldier risk behavior evaluation scales, and consists of personal basic information (4 dimensions of gender, time to live and the like), physiological factors (mental disorders, somatic diseases, substance abuse and dependence), psychological factors (personality characteristics comprise nervosa, exotropy, cognitive characteristics comprise 7 dimensions of thinking stiffness, coping modes and ruminant thinking, emotion characteristics comprise anxiety, depression, frustration and the like), and social factors (4 dimensions of stress life events, social support and the like). In addition, more objective and real risk behavior related data are obtained by analyzing the risk behavior psychology or behavior characteristics and physiological, psychological and social factors influencing the risk behaviors expressed in the house tree man projection test and expert interview evaluation, so that data deviation caused by the pubic feeling and lie tendency of the patient is corrected to the greatest extent.
Scale evaluation data: the psychological scale is used for distinguishing different performance categories and grades from psychological problems by numbers, and scale evaluation data consisting of a plurality of digital groups is formed by selecting the categories and grades. Dividing the psychological problem into 245 questions, scoring the partial questions 1-5, scoring the partial questions 1-2, and selecting a score suitable for each soldier according to the description of different scores of each question, so that each person obtains 245 pieces of data, namely 245 pieces of psychological evaluation data of each person in all soldiers; such as: is one of the questions that you would get free of oneself by suicide? Please select: score 1-very exclusive; 2 min-rejection; 3 min-neutral; score 4-comparative acceptance; 5 minutes-accept.
Drawing characteristic data of the house tree man: the house tree man test is used as a psychological projection experiment, a drawing test and a psychological condition analysis method, has unique advantages compared with scale evaluation in identifying the hidden risk behaviors of individuals, and can be used for measuring the true ideas of the individuals about the risk behaviors. Drawing characteristics of each person are classified into mental dimensions of mental disorder, character characteristics, destinatism, social support, emotion characteristics, stress level, risk behavior ideas or risk behavior attempts by professionals, and 1-5 scores or 1-2 scores are performed. And correcting deviation in the scale evaluation data by using the house tree man drawing characteristic data to form combined evaluation data.
Risk behavior assessment score: basic methods for learning about risk-behavioural-related psychology and behaviour of interviewees by professionals talking face-to-face with interviewees; the method mainly comprises the following dimensions of mental disorder, physical disorder substance abuse and dependence, character characteristics, destiny feeling, social support, emotion characteristics, stress level, childhood adversity, suicide failure condition, death intolerance, risk behavior idea or risk behavior attempt and the like, wherein the relevant dimensions are scored by 1-5 or 1-2 to form risk behavior judgment scores, and risk behavior crowds are determined according to the risk behavior judgment scores obtained by expert interview evaluation.
Step S2: preprocessing data, carrying out normalization processing to form a unified data format of 0 to 1, and forming combined evaluation data by using deviation of the same subject or dimension in the range function correction scale evaluation data and the house tree painting characteristic data; in order to obtain a simplified and reliable prediction model, the influence of redundancy and repeated variables on the performance of the model is avoided. First, using Spearman moment correlation, point two-column correlation analysis, χ 2 Checking to calculate the association between the predicted variables, and deleting one variable with overlarge association; second, binary logistic regression is used to detect potential predicted variables, and stepwise regression is performed to select significant predicted variables in the logistic regression.
The pretreatment provided by the embodiment of the invention comprises the following steps: and normalizing the data to form a unified data format of 0 to 1. Taking the above example of dividing the psychological problem into 245 questions, the original data collected by the psychological characteristic data is data between 1 and 5 (1 to 151 questions and 196 to 245 questions) and between 1 and 2 (152 to 195 questions), and all psychological data are normalized and converted into a numerical value between 0 and 1. The normalization formula is as follows:
wherein x is i Normalized value of the ith question, t i Value of the i-th subject of sample data, t i min For the minimum value of the ith question of all collected sample data, t i max The maximum value of the ith question item is the acquired all sample data.
For example, if the values of the data of questions 1 to 151 and questions 196 to 245 are all 1 to 5, i.e., the minimum values are all 1 and the maximum values are all 5, the formulas for conversion are all
If the data of 152-195 questions are all 1-2, the minimum values are all1, the maximum values are all 2, so the conversion formulas are all
Normalization was done for 245 questions for all samples according to the method described above.
The pretreatment provided by the embodiment of the invention further comprises the following steps: correcting deviation in the table evaluation data by using the house tree drawing characteristic data, judging the difference value of the table evaluation data and the corresponding question or dimension of the house tree drawing characteristic data in each sample data by using a range function, wherein the difference value is more than 0.2, eliminating the table evaluation data, finally keeping the data based on the house tree drawing characteristic data, otherwise, based on the table evaluation data, and finally forming combined evaluation data.
MEW(x,y)=[|x i -y i |w xi ,|x i -y i |w yi ]
x i Evaluating the questions or dimensions, y, in the data for the sample data i-th scale i Drawing corresponding questions or dimension data in characteristic data for the house tree man when |x i -y i When the level is less than or equal to 0.2, w xi =1,w yi When |x=0 i -y i |>At 0.2, w xi =0,w yi =1。
To avoid redundancy and duplicate variables affecting the predictive model, spearman moment correlation, point two-column correlation analysis, χ was used 2 Checking and calculating the association between the predicted variables, and deleting one of the variables with overlarge association, wherein the specific steps are as follows:
1. the correlation of 15 continuous predictive variables such as time to wood, soldier's rank, death intolerance, nervosity, exotropy and the like is calculated by using the Spearman moment correlation, and the formula is as follows:
in the method, in the process of the invention,for the distance of one of the predicted variables to its average, < >>For the distance of another of the predicted variables to its average. Analysis finds that the attribution sense and the social support have strong correlation, which indicates that the measured psychological trait is very similar, refers to the risk behavior related literature, and removes the prediction variable of attribution sense.
2. The correlation of 5 classification variables such as gender, mental disorder and the like and 15 continuous variables such as age, time to wood, soldier's rank, nerve matter, exotropy and the like is calculated by using a point two-column correlation analysis, and a calculation formula is as follows:
in the method, in the process of the invention, Is the average number of continuous variables corresponding to one value of the bivariate; />Is the average number of continuous variables corresponding to another value of the bivariate; the ratio of the two values of the p and q binary variables; s is(s) t Is the standard deviation of the continuous variable. Analysis shows that the correlation coefficient between variables is lower than 0.4, and the correlation is weaker.
3. The correlation between sex, mental disorder, somatic disease, substance abuse and dependence, stress during childhood, and related classification variables of risk exposure history was calculated using χ2 test analysis, with the formula:
wherein f 0 Actual number of observations; f (f) e Theoretical number of observations. Analysis found that there was no correlation between variables。
The binary logistic regression further selects data related to the predicted target, the specific method is that the variables are increased from less to more, one at a time, until no variable can be introduced, and the specific steps are as follows:
1. for 24 regression prediction variables X 1 ,X 2 ,X 3 ,…,X 24 Respectively establishing a unary logistic regression model with the predicted target risk behaviors;
Odds=e β0+βiX1+ε
Log(Odds)=β 0i X 1 +ε,i=1,…,P
odds = risky/risky behaviour
Calculating variable X 1 ,X 2 ,X 3 ,…X 24 The value of the test statistic of the corresponding regression coefficient is denoted as F 1 (1) ,…,F 22 (1) Taking the maximum F i1 (1) Taking the maximum value, namely:
F i1 (1) =max{F 1 (1) ,…,F 24 (1) }
for a given significance level of 0.05, the corresponding threshold is noted as F (1) ,F i1 (1) >F (1) X is then i1 Introducing regression model, record I 1 To select a variable index set.
2. Establishing a prediction target Log (Odds) and a subset { X } of prediction variables i1 ,X 1 },…,{X i1 ,X i1-1 },{X i1 ,X i1+1 },…,{X i1 ,X 24 Binary regression models, a total of 23. Calculating the statistical magnitude of the regression coefficient F test of the variable, which is recorded asThe largest one is selected and marked as F i2 (2) Corresponding prediction variable foot is marked as i 2 The method comprises the following steps:
F i2 (2) =max{F 1 (2) ,…,F i1-1 (2) ,F i1+1 (2) ,…,F p (2) }
for a given significance level of 0.05, the corresponding threshold is noted as F (2) ,F i2 (2) >F (2) Then variable X i2 Introducing a regression model; otherwise, the variable introduction process is terminated.
3. Consider the prediction variable vs. variable subset { X ] i1 ,X i2 ,X k Regression of } repeat step 2. The method is repeated, one at a time from the predicted variables that were not introduced into the regression model until no variables were introduced as checked. Finally, 15 variables such as depression, stress life events, social support, burdensome feeling and the like are selected as prediction variables.
Step S3: taking the crowd with risk behaviors in the risk behavior judgment score as a prediction target, taking 15 variables such as obvious variable depression, pressure life events, social support, burdensome feeling, adversity in childhood period and the like predicted by binary logistic regression as prediction variables, and establishing a prediction model of sample data by using a gradient lifting decision tree algorithm.
The gradient lifting decision tree algorithm for soldier risk behavior prediction provided by the embodiment of the invention establishes a prediction model of sample data, and firstly, the invention randomly partitions a sample data set into a sample data set with the proportion of 3: the training set and the testing set are used for training a gradient lifting decision tree prediction model, and proper super parameters are set according to the highest prediction accuracy of the model; secondly, the independent test set is only used for verifying and evaluating the model, proper super parameters are set for training the gradient lifting decision tree prediction model on the new balance training set, and multi-azimuth performance indexes of the evaluation model are independently verified; and finally, outputting risk behavior indexes through the model, and outputting the relative importance weights of the predicted variables.
The training process of the gradient lifting decision tree prediction model provided by the embodiment of the invention comprises the following steps:
as shown in fig. 3, the gradient lifting decision tree contains a plurality of decision trees, and the final prediction model is generated by the results of all the decision trees together; each decision tree is constructed to reduce the residual error of the previous model, and the final residual error is close to zero point in the gradient direction in a continuous iteration mode.
As shown in fig. 4, an algorithm flow for implementing a gradient lifting decision tree algorithm for risk behavior samples to perform statistical classification provided in the embodiment of the present invention specifically includes:
Firstly, respectively extracting predicted variables such as volatility, risk behavior exposure history and the like, carrying out normalization processing on the predicted variables, and removing redundancy and repeated predicted variables by using correlation analysis and binary logistic regression.
And secondly, taking the predicted variable as an input sample of the model, continuously training and learning, and finally outputting the model to obtain a risk behavior predicted result.
The predictive model training and learning process provided by the embodiment of the invention is as follows:
1. inputting the obtained prediction variable into a 1 st gradient lifting decision tree to obtain the estimation of the model on the training sample;
2. calculating the model residual error based on the obtained sample estimation result;
3. and (3) based on the original sample input information and the residual error, repeatedly training the 2 nd model according to the process until M models are trained, and finally obtaining a predicted result of the risk behavior.
The gradient lifting decision tree algorithm flow provided by the embodiment of the invention specifically comprises the following steps:
training data set t= { (x) for risk behaviors containing N samples 1 ,y 1 ),...,(x N ,y N ) Specific algorithm flows are as follows:
1. first the learner is initialized, namely:
wherein f 0 (x) For an initial tree with only one root node, c is a constant that minimizes the loss function, L (y i C) is a loss function for calculating the difference between the target value and the calculated value, wherein y i Is the ith training data.
In order to further improve the performance of the model and reduce the residual value, a log-likelihood function is introduced as a loss function to reduce the residual loss of the sample, and the expression is as follows:
L(y,f(x))=log(1+exp(-yf(x)))
2. let the number of iterations m=1, 2,) then for each sample i=1, 2, &, N, calculate the negative gradient of the i-th training sample, i.e. the residual, then:
the residual value obtained is taken as the true value of the new sample, and the residual value is determined according to the sample and the negative gradient direction (x, r mi ) (i=1, 2,.,. N.) computing fits the residual values to obtain a decision tree T consisting of J leaf nodes m The corresponding leaf node area is R mj (j=1, 2,., J), then there are best fit values for each leaf node:
updating the strong learner, then:
where I is the display function of the ith training sample at the jth leaf node area.
3. After M rounds of iteration, the final learning is obtained, and then:
f 0 (x) C for an initial tree with only one root node mj To minimize the constant of the loss function, I is the explicit function of the ith training sample in the jth leaf node area, resulting in a decision tree T consisting of J leaf nodes m The corresponding leaf node area is R mj (j=1,2,...,J)。
In order to further improve the model prediction accuracy, the process of improving the prediction effect by setting the model super-parameters is as follows:
1. as the GBDT model finally appears as a binary tree result, the maximum depth alpha of each decision tree and the number beta of leaf nodes not only can influence the complexity of the model, but also can easily cause the model to be over-fitted, thereby influencing the final prediction accuracy.
In the process of the demonstration analysis, the setting of the parameter set is required to be continuously adjusted to determine the optimal parameters alpha and beta. The F1 score is used as an evaluation of model prediction accuracy, and the F1 score is a common index used for measuring the accuracy of the two classification models in statistics. And obtaining the relation between the parameter alpha and the prediction accuracy by using a sample training model. The prediction accuracy of the samples trained by the GBDT model is above 77%, the model prediction accuracy is obviously changed along with the change of the maximum depth of the tree, when the maximum depth of the tree is 5, the model prediction accuracy is at most 81%, and when the maximum depth of the tree exceeds 5 or is less than 5, the model prediction accuracy is below 81%, so that the optimal parameter alpha determined by the invention is 5.
2. Too many leaf nodes can influence the generalization capability of the model, so that the risk of over-fitting of the model is increased, and the number of leaf nodes of each decision leaf must be regulated and controlled in the model training stage.
In the verification process, the correlation between the model estimation accuracy and the maximum leaf node number is obtained by repeatedly adjusting the setting of the parameter set. The estimated accuracy of the model is improved along with the increase of the number of leaf nodes in the initial stage, namely, the model can carry out more accurate judgment on sample attributes after the prediction variable is refined and split, then the estimated accuracy reaches the maximum value immediately after the decrease of the number of nodes is increased, and then the estimated accuracy of the model is repeatedly adjusted to fluctuate along with the further increase of the number of the maximum leaf nodes. When the maximum number of leaf nodes is 10, the model precision is about 81.3% at the highest, so that the optimal parameter beta determined by the invention is 10.
3. After the optimal parameters alpha and beta of the model are set through the sample training data set, a decision base classifier with higher prediction accuracy can be further obtained, each decision tree is built on the basis of residual errors after the sample traverses the previous sub-tree, and the final prediction result is the accumulated sum of the prediction results of all the previous sub-trees. 100 sub-classification decision trees with higher pre-judgment accuracy are finally generated by using the sample training set. According to the relation between the number of the decision trees and the model estimation accuracy, when the accumulated decision trees are 81, the model estimation accuracy is highest, so that the first 81 decision trees can be taken to form a sample classification base.
The process for checking the training model effect by using the test set provided by the embodiment of the invention comprises the following steps:
after the prediction model is established, the prediction effect of the model is evaluated on an independent test set, and the advantages and disadvantages of the prediction effects of different models are compared. The prediction model constructed by the invention finally outputs the predicted class probability, namely the probability of risk behavior, and the value is 0-1. And selecting a probability value at the maximum F1 value on the training set as an early warning value, judging that the risk-free behavior is smaller than the early warning value, and judging that the risk-free behavior is larger than the early warning value. Based on the early warning value, the early warning performance of the prediction model is further evaluated, and the evaluation indexes are as follows:
as shown in table 1, the auc value, accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 value, predictive data were 90.92%, 83.74%, 85.76%, 81.71%, 82.42%, 85.16%, 85.01, respectively.
In addition, performance comparison is carried out on the same data set with 4 different data analysis methods of binary logistic regression, support vector machine, random forest and extreme gradient enhancement algorithm, so that the accuracy and reliability of the gradient enhancement decision tree model are further proved. And finally, storing the obtained optimal machine learning model for predicting risk behaviors of the soldier individuals in the future.
TABLE 1 Risk predictive evaluation index for test set and training set models
The gradient lifting decision tree and the random forest model can better distinguish whether risk behaviors exist or not through performance evaluation index comparison of various models, as shown in table 1, wherein the gradient lifting decision tree model with the best performance successfully predicts the risk behaviors, AUCs on an independent test set and a training set are 88.33 and 90.92 respectively, and the model can detect 84.32% of soldiers with risk behaviors and 85.76% of soldiers with risk behaviors respectively.
The output prediction variable relative importance weight provided by the embodiment of the invention comprises the following steps:
different from other models, the gradient lifting decision tree model can identify and sort the importance of the predicted variables according to the influence degree on the predicted result, and can not only shorten the calculation time and accelerate the training speed, but also improve the prediction precision of the model, and the specific method is as follows:
for a single decision tree T, the importance of the decision tree T can be obtained by calculating according to the number of times that the variable is selected as a decision tree splitting variable in the iterative process, and the importance is shown in the following formula:
wherein J-1 is the number of non-leaf nodes, v t Is a feature associated with the non-leaf node t,the node is a reduction value after splitting in a square error mode, and the larger the value is, the higher the influence degree of the characteristic parameters on the prediction result is, the more important is;
For the set of decision trees { T ] m } M The global importance of a feature variable can be measured by its average value of importance in a single decision tree, as shown in the following equation:
/>
where M is the number of decision trees,is the m-th decision of the predictive variable kImportance in the tree, and the sum of the importance of all the predicted variables is 1.
Table 2 relative importance of gradient-lifting decision tree model predicted variables
For the importance of the predicted variables, as shown in table 2, the first five predicted variables identified by the gradient lifting decision tree model are in order of depression (29.50%), stress life events (24.92%), social support (9.19%), anxiety (6.06%) and frustration (5.29%), and according to the important predicted variables, targeted interventions can be developed for individuals at risk.
Step S4: inputting the risk behavior prediction variable into a risk behavior gradient lifting gradient tree prediction model, and outputting a corresponding risk behavior prediction value; and forming risk behavior indexes according to risk behavior predicted values of different individuals, and dividing risk behavior levels.
1. And on the premise that the risk behavior gradient lifting decision tree prediction model exists, conveying the new individual prediction variable to the risk behavior gradient lifting decision tree prediction model.
2. The risk behavior gradient lifting decision tree prediction model calculates a plurality of sub-decision tree models of the risk behavior, and then the obtained risk behavior prediction values generate a multi-decision tree prediction value data set, wherein one piece of data corresponds to one sub-decision tree; the risk behavior gradient lifting decision tree prediction model sends a multi-decision tree prediction value data set to a risk behavior level prediction and judgment module.
3. The risk behavior level prediction and judgment module performs normalization calculation on all predicted values in the multi-decision tree predicted value dataset:wherein x is a value to be processed, mapping the previous predicted result value into a real number in a (0, 1) interval by using the function, judging based on the value, and judging that no risk exists when the S (x) value is larger than a threshold value; when the S (x) value is less than the threshold, a risky behavior is determined.
4. Outputting a risk behavior predicted value with a value of 0 to 1, converting data with a value of 0 to 1 into a percentile to form a risk behavior index with a 0-100 percentile, and dividing different risk behavior levels according to different risk behavior indexes. If the threshold value according to the above 3 parts is divided into one level above the threshold value, the risk is not found, and the threshold value is divided into 3 levels below the threshold value, namely low risk, medium risk and high risk.
It should be noted that the embodiments of the present invention can be realized in hardware, software, or a combination of software and hardware. The hardware portion may be implemented using dedicated logic; the software portions may be stored in a memory and executed by a suitable instruction execution system, such as a microprocessor or special purpose design hardware. Those of ordinary skill in the art will appreciate that the apparatus and methods described above may be implemented using computer executable instructions and/or embodied in processor control code, such as provided on a carrier medium such as a magnetic disk, CD or DVD-ROM, a programmable memory such as read only memory (firmware), or a data carrier such as an optical or electronic signal carrier. The device of the present invention and its modules may be implemented by hardware circuitry, such as very large scale integrated circuits or gate arrays, semiconductors such as logic chips, transistors, etc., or programmable hardware devices such as field programmable gate arrays, programmable logic devices, etc., as well as software executed by various types of processors, or by a combination of the above hardware circuitry and software, such as firmware.
The foregoing is merely illustrative of specific embodiments of the present invention, and the scope of the invention is not limited thereto, but any modifications, equivalents, improvements and alternatives falling within the spirit and principles of the present invention will be apparent to those skilled in the art within the scope of the present invention.

Claims (10)

1. The risk behavior information prediction method based on the gradient lifting decision tree is characterized by comprising the following steps of:
the system respectively acquires scale evaluation data, treemap drawing characteristic data and expert interview evaluation data through psychological scale evaluation, treemap projection test and expert interview evaluation, preprocesses the three data to form data characteristics to be identified of the model, and synthesizes the scale evaluation data and the treemap drawing characteristic data into combined evaluation data by applying a range function; the model takes the characteristics of the combined evaluation data as a prediction variable, takes the crowd with risk behaviors in expert interview evaluation data as a prediction target, randomly divides the data into a training set and a test set, establishes a gradient lifting decision tree model by using the training set data, sends the data characteristics of the test set into a machine learning model, and verifies the model effect; and finally, the model is used for predicting new individual risk behaviors, the behavior prediction variables are input into the gradient lifting decision tree prediction model, risk behavior indexes are output, and risk behavior levels are divided.
2. The risk behavior information prediction method based on a gradient boost decision tree according to claim 1, wherein the risk behavior information prediction method based on a gradient boost decision tree comprises the steps of:
Step one, collecting risk behavior information evaluation data through three risk behavior evaluation methods, namely psychological scale evaluation, treemap projection test and expert interview evaluation;
step two, carrying out normalization processing on the data, and analyzing χ by utilizing Spearman moment correlation, point two-column correlation 2 Removing redundant variables by a test and binary logistic regression data analysis method;
step three, establishing training set and testing set data, utilizing the training set to train the gradient to promote the decision tree prediction model, and utilizing the testing set to check the gradient to promote the decision tree prediction model performance;
and step four, inputting the risk behavior prediction variable into a gradient lifting decision tree prediction model, outputting a risk behavior prediction value, converting the risk behavior prediction value into a risk behavior index by using a certain rule, and dividing the risk behavior level.
3. The method for predicting risk performance information based on a gradient-enhanced decision tree as recited in claim 2, further comprising, prior to step one:
and acquiring three kinds of evaluation data, namely scale evaluation data, house tree person drawing characteristic data and risk behavior evaluation scores by an evaluation tool system platform.
4. The risk behavior information prediction method based on a gradient boosting decision tree according to claim 2, wherein the normalizing the data in the second step comprises:
Normalizing the data to form a unified data format of 0 to 1; dividing psychological problems into n questions, wherein the original data collected by psychological characteristic data is data between 1 and 5 and 1 and 2; all psychological data normalization is converted into a numerical value between 0 and 1, and the normalization formula is as follows:
wherein x is i Normalized value of the ith question, t i Value of the i-th subject of sample data, t i min For the minimum value of the ith question of all collected sample data, t i max The maximum value of the i-th question for all collected sample data;
in the second step, spearman moment correlation, point two-row correlation analysis and χ are utilized 2 And removing redundant variables by a binary logistic regression data analysis method comprises the following steps: using Spearman moment correlation, point two column correlation analysis, χ 2 Checking and calculating the association among the predicted variables, and deleting the variables with overlarge association, wherein the method specifically comprises the following steps of:
(1) The correlation of 15 continuous prediction variables such as the time to five, the soldier's rank, the death intolerance, the nerve quality, the exotropy and the like is calculated by utilizing the Spearman moment correlation, and the calculation formula is as follows:
in the method, in the process of the invention,distance from one predictive variable to average, < +.>Distance to average for another predicted variable; analysis finds that the attribution sense and the social support have strong correlation, which indicates that the measured psychological trait is very high in similarity, refers to risk behavior related documents, and removes attribution sense prediction variables;
(2) The correlation of 5 classification variables such as gender, mental disorder and the like and 15 continuous variables such as age, time to wood, soldier's rank, nerve matter, exotropy and the like is calculated by using a point two-column correlation analysis, and a calculation formula is as follows:
in the method, in the process of the invention,is the average number of continuous variables corresponding to one value of the bivariate; />Is the average number of continuous variables corresponding to another value of the bivariate; the ratio of the two values of the p and q binary variables; s is(s) t Is the standard deviation of the continuous variable; the correlation coefficients among variables are all lower than 0.4, and the correlation is weaker;
(3) The correlation between the sex, mental disorder, somatic disease, substance abuse and dependency related dichotomous variables was calculated using χ2 test analysis, the calculation formula was:
wherein f 0 Actual number of observations; f (f) e Theoretical observation times, and no correlation among variables;
the binary logistic regression is used for selecting data related to a predicted target, and specifically comprises the following steps:
(1) For 24 regression prediction variables X 1 ,X 2 ,X 3 ,…,X 24 Respectively establishing a unary logistic regression model with the predicted target risk behaviors;
Odds=e β0+βiX1+ε
Log(Odds)=β 0i X 1 +ε,i=1,…,P;
odds = risky/risky;
calculating variable X 1 ,X 2 ,X 3 ,…X 24 The value of the test statistic of the corresponding regression coefficient is denoted as F 1 (1) ,…,F 22 (1) Taking the maximum F i1 (1) Taking the maximum value, then:
F i1 (1) =max{F 1 (1) ,…,F 24 (1) };
for a given significance level of 0.05, the corresponding threshold is noted as F (1) ,F i1 (1) >F (1) X is then i1 Introducing regression model, record I 1 Selecting a variable index set;
(2) Establishing a prediction target Log (Odds) and a subset { X } of prediction variables i1 ,X 1 },…,{X i1 ,X i1-1 },{X i1 ,
X i1+1 },…,{X i1 ,X 24 A binary regression model; calculating the statistical magnitude of the regression coefficient F test of the variable, which is recorded asThe largest one is selected and marked as F i2 (2) Corresponding prediction variable foot is marked as i 2 Then:
F i2 (2) =max{F 1 (2) ,…,F i1-1 (2) ,F i1+1 (2) ,…,F p (2) };
paired feedingA fixed significance level of 0.05, and a corresponding critical value of F is recorded (2) ,F i2 (2) >F (2) Then variable X i2 Introducing a regression model; otherwise, terminating the variable introduction process;
(3) Variable subset { X } based on predicted variables i1 ,X i2 ,X k The regression of the three-dimensional model is repeated in the step (2), one of the predicted variables which are not introduced into the regression model is selected each time until no variable is introduced through inspection, and 15 variables such as depression, pressure life events, social support, burdensome feeling, and childhood adversity are finally selected as the predicted variables.
5. The risk behavior information prediction method based on a gradient lifting decision tree according to claim 2, wherein in the third step, a risk behavior crowd in a risk behavior judgment score is taken as a prediction target, 15 variables including significant variable depression, pressure life event, social support, feeling of redundancy and childhood period adversity of binary logistic regression prediction are taken as prediction variables, and a gradient lifting decision tree algorithm is used for establishing a prediction model of sample data;
The method for establishing the prediction model of the sample data by utilizing the gradient lifting decision tree algorithm for predicting the soldier risk behaviors comprises the following steps: randomly partitioning the sample dataset into a ratio of 3: the training set and the testing set are used for training a gradient lifting decision tree prediction model, and proper super parameters are set according to the highest prediction accuracy of the model; the independent test set is only used for verifying and evaluating the model, the gradient lifting decision tree model is trained on the new balance training set, and the multidirectional performance indexes of the evaluation model are independently verified; and outputting risk behavior indexes through the model, and outputting the relative importance weights of the predicted variables.
The gradient lifting decision tree contains a plurality of decision trees, and the prediction model is generated by the results of all the decision trees together;
the algorithm flow for carrying out the gradient lifting decision tree algorithm for statistical classification on the risk behavior sample comprises the following steps:
respectively extracting predicted variables and carrying out normalization processing, and removing redundancy and repeated predicted variables by using correlation analysis and binary logistic regression; training and learning are continuously carried out on the input samples of which the predicted variables are the models, and finally, the output of the models is a risk behavior predicted result;
the prediction model training and learning process comprises the following steps: inputting the obtained prediction variable into a 1 st gradient lifting decision tree to obtain the estimation of a model on a training sample; calculating a model residual error based on the obtained sample estimation result; repeating training the 2 nd model based on the original sample input information and the residual error until M models are trained, and finally obtaining a prediction result of the risk behaviors;
Training data set t= { (x) for risk behaviors containing N samples 1 ,y 1 ),...,(x N ,y N ) The gradient lifting decision tree algorithm flow comprises:
(1) Initializing a learner; :
wherein f 0 (x) For an initial tree with only one root node, c is a constant that minimizes the loss function, L (y i C) is a loss function for calculating the difference between the target value and the calculated value, y i Is the ith training data;
the log-likelihood function is introduced as a loss function to reduce residual loss of the sample, and the expression is:
L(y,f(x))=log(1+exp(-yf(x)));
(2) Let the number of iterations m=1, 2,) then for each sample i=1, 2, &, N, calculate the negative gradient of the i-th training sample, then the calculation formula for the residual is:
taking the obtained residual value as the true value of the new sample, and according to the sample and the negative gradient direction (x, r mi ) (i=1, 2,.,. N.) computing fits the residual values to a decision tree T consisting of J leaf nodes m The corresponding leaf node area is R mj (j=1, 2,., J), then there are best fit values for each leaf node:
updating the strong learner, then:
wherein I is the display function of the ith training sample in the jth leaf node area;
(3) After M rounds of iteration, the final learning is obtained, and then:
f 0 (x) C for an initial tree with only one root node mj To minimize the constant of the loss function, I is the explicit function of the ith training sample in the jth leaf node area, resulting in a decision tree T consisting of J leaf nodes m The corresponding leaf node area is R mj (j=1,2,...,J)。
The output prediction variable relative importance weights include:
for a single decision tree T, the importance is obtained by calculating according to the number of times that the variable is selected as a decision tree splitting variable in the iterative process, and the importance is obtained by calculating according to the following formula:
wherein J-1 is the number of non-leaf nodes, v t Is a feature associated with the non-leaf node t,is the reduced value of the node after splitting in a square error mode;
for the set of decision trees { T ] m } M The global importance of a feature variable is measured by its average value of importance in a single decision tree, as shown in the following equation:
where M is the number of decision trees,is the importance of the predictive variable k in the mth decision tree, and the sum of the importance of all predictive variables is 1.
6. The risk behavior information prediction method based on a gradient lifting decision tree according to claim 2, wherein in the fourth step, a risk behavior prediction variable is input into a risk behavior gradient lifting gradient tree prediction model, and a corresponding risk behavior prediction value is output; forming risk behavior indexes according to risk behavior predicted values of different individuals, and dividing risk behavior levels;
On the premise that a risk behavior gradient lifting decision tree prediction model exists, conveying a new individual prediction variable to the risk behavior gradient lifting decision tree prediction model;
the risk behavior gradient lifting decision tree prediction model is utilized to calculate a plurality of sub-decision tree models of the risk behavior, and then a multi-decision tree prediction value data set is generated by the obtained risk behavior prediction values, wherein one piece of data corresponds to one sub-decision tree; the risk behavior gradient lifting decision tree prediction model sends a multi-decision tree predicted value data set to a risk behavior level prediction and judgment module;
carrying out normalized calculation on all predicted values in the multi-decision tree predicted value dataset by using a risk behavior level prediction and judgment module:where x is the value to be processed, the previous predictor value is mapped to using a functionReal numbers in the (0, 1) interval are judged based on the numerical values; when the S (x) value is larger than the threshold value, judging that the risk-free behavior exists; when the S (x) value is smaller than the threshold value, judging that the risk behavior exists;
outputting a risk behavior predicted value with a value of 0 to 1, converting data with a value of 0 to 1 into a percentile to form a risk behavior index with a 0-100 percentile, and dividing different risk behavior levels according to different risk behavior indexes.
7. A risk performance information prediction system based on a gradient boost decision tree according to any one of claims 1 to 6, comprising:
the data acquisition module is used for collecting risk behavior information evaluation data through three risk behavior evaluation methods of psychological scale evaluation, atrial treetop projection test and expert interview evaluation;
the data preprocessing module is used for carrying out normalization processing on the data and utilizing Spearman moment correlation, point two-column correlation analysis and χ 2 Removing redundant variables by a test and binary logistic regression data analysis method;
the model training module is used for establishing training set and test set data, training gradient lifting decision tree prediction models by using the training set, and lifting decision tree prediction model performance by using test set checking gradient;
the risk behavior prediction and judgment module is used for inputting the risk behavior prediction variable into the gradient lifting decision tree prediction model, outputting a risk behavior prediction value, converting the risk behavior prediction value into a risk behavior index by utilizing a certain rule, and dividing the risk behavior level.
8. A computer device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, causes the processor to perform the steps of the gradient boost decision tree-based risk behavior information prediction method of any one of claims 1 to 6.
9. A computer readable storage medium storing a computer program which, when executed by a processor, causes the processor to perform the steps of the risk performance information prediction method based on a gradient boost decision tree as claimed in any one of claims 1 to 6.
10. An information data processing terminal, wherein the information data processing terminal is configured to implement the risk behavior information prediction system based on a gradient boosting decision tree according to claim 7.
CN202310161573.1A 2023-02-23 2023-02-23 Risk behavior information prediction method and system based on gradient lifting decision tree Pending CN116502742A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310161573.1A CN116502742A (en) 2023-02-23 2023-02-23 Risk behavior information prediction method and system based on gradient lifting decision tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310161573.1A CN116502742A (en) 2023-02-23 2023-02-23 Risk behavior information prediction method and system based on gradient lifting decision tree

Publications (1)

Publication Number Publication Date
CN116502742A true CN116502742A (en) 2023-07-28

Family

ID=87329204

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310161573.1A Pending CN116502742A (en) 2023-02-23 2023-02-23 Risk behavior information prediction method and system based on gradient lifting decision tree

Country Status (1)

Country Link
CN (1) CN116502742A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116936106A (en) * 2023-09-18 2023-10-24 天津医科大学第二医院 Method and system for evaluating risk of dangerous event in dialysis
CN117610891A (en) * 2024-01-22 2024-02-27 湖南小翅科技有限公司 Flexible work order and risk control system based on big data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116936106A (en) * 2023-09-18 2023-10-24 天津医科大学第二医院 Method and system for evaluating risk of dangerous event in dialysis
CN116936106B (en) * 2023-09-18 2023-12-22 天津医科大学第二医院 Method and system for evaluating risk of dangerous event in dialysis
CN117610891A (en) * 2024-01-22 2024-02-27 湖南小翅科技有限公司 Flexible work order and risk control system based on big data
CN117610891B (en) * 2024-01-22 2024-04-02 湖南小翅科技有限公司 Flexible work order and risk control system based on big data

Similar Documents

Publication Publication Date Title
CN107633265B (en) Data processing method and device for optimizing credit evaluation model
US11449673B2 (en) ESG-based company evaluation device and an operation method thereof
Huang et al. An empirical analysis of data preprocessing for machine learning-based software cost estimation
CN116502742A (en) Risk behavior information prediction method and system based on gradient lifting decision tree
Vaishnavi et al. Predicting mental health illness using machine learning algorithms
Wang et al. Machine learning-based prediction system for chronic kidney disease using associative classification technique
Riazy et al. Fairness in Learning Analytics: Student At-risk Prediction in Virtual Learning Environments.
KR20200075120A (en) Business default prediction system and operation method thereof
Murad et al. Computer-aided system for extending the performance of diabetes analysis and prediction
Roslan Prediction of student dropout in Malaysian’s private higher education institute using data mining application
Kumar et al. Comparison of Machine learning models for Parkinson’s Disease prediction
CN117219127A (en) Cognitive state recognition method and related equipment
Garcia de Alford et al. Reducing age bias in machine learning: An algorithmic approach
Upadhyay et al. Prediction of diabetes in adults using supervised machine learning model
Ramdhani et al. Heart failure prediction based on random forest algorithm using genetic algorithm for feature selection
CN113743461B (en) Unmanned aerial vehicle cluster health degree assessment method and device
Yang et al. An evidential reasoning rule-based ensemble learning approach for evaluating credit risks with customer heterogeneity
CN113361653A (en) Deep learning model depolarization method and device based on data sample enhancement
Mythily et al. An efficient feature selection algorithm for health care data analysis
Thangarasu et al. Detection of Cyberbullying Tweets in Twitter Media Using Random Forest Classification
Sujithra et al. An intellectual decision system for classification of mental health illness on social media using computational intelligence approach
Sagbakken Using Machine Learning to Predict Elite Female Athletes' Readiness to Play in Soccer
Büyükatak et al. An investigation of data mining classification methods in classifying students according to 2018 PISA reading scores
Ravaji et al. CSChO-deep MaxNet: Cat swam chimp optimization integrated deep maxout network for heart disease detection
Malathi et al. Revolutionizing Deep Vein Thrombosis (DVT) Management: Machine Learning Unveils Precision in Early Detection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination