CN115526386A - Survival analysis method for individual industrial and commercial customers - Google Patents

Survival analysis method for individual industrial and commercial customers Download PDF

Info

Publication number
CN115526386A
CN115526386A CN202211114151.0A CN202211114151A CN115526386A CN 115526386 A CN115526386 A CN 115526386A CN 202211114151 A CN202211114151 A CN 202211114151A CN 115526386 A CN115526386 A CN 115526386A
Authority
CN
China
Prior art keywords
individual industrial
commercial
data
survival
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211114151.0A
Other languages
Chinese (zh)
Inventor
谷晓丽
董波
王立
陈怡桐
覃缘琪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202211114151.0A priority Critical patent/CN115526386A/en
Publication of CN115526386A publication Critical patent/CN115526386A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0635Risk analysis of enterprise or organisation activities

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Human Resources & Organizations (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Economics (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • Tourism & Hospitality (AREA)
  • Game Theory and Decision Science (AREA)
  • General Business, Economics & Management (AREA)
  • Operations Research (AREA)
  • Development Economics (AREA)
  • Quality & Reliability (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Educational Administration (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention belongs to the technical field of survival influence factors and survival analysis of individual industrial and commercial customers, and discloses a survival analysis method of individual industrial and commercial customers, which comprises the following steps: s1, collecting survival history data of individual industrial and commercial businesses in a region to form sample data; s2, determining an independent variable X and a dependent variable Y; s3, preprocessing the sample data; s4, selecting the characteristics of the independent variable X by using a mutual information method to obtain final sample data S _ E; s5, carrying out data set division on the sample data S _ E to obtain a training set and a test set; s6, performing model training by using a regression random forest model; and S7, quantitatively evaluating the trained model. The invention utilizes a machine learning method to analyze factors influencing the survival of the individual industrial and commercial customers and reasonably judges the cancellation risk, the survival time and other states of the individual industrial and commercial customers. The relationship between the influence factors influencing the individual industrial and commercial businesses and the life lives of the industrial and commercial businesses is established, and the survival analysis of the individual industrial and commercial businesses is realized.

Description

Survival analysis method for individual industrial and commercial customers
Technical Field
The invention belongs to the technical field of individual industrial and commercial customer survival influence factors and survival analysis, and particularly relates to an individual industrial and commercial customer survival analysis method.
Background
The individual industrial and commercial enterprises are important forces for flourishing economy, promoting employment and keeping social stability. However, due to the small scale of the individual industrial and commercial customers, the problems of being susceptible to the environment, weak in risk resistance and the like exist, so that the individual industrial and commercial customers are difficult to live and have large logout risks. Therefore, exploring the rule between the survival factors and the survival risks of the individual industrial and commercial businesses, and researching the method for sensing the internal and external environment changes in advance is an urgent problem to be solved at present.
Survival analysis has wide application in the fields of medical health and finance, and aims to research the probability of an event of interest occurring in an individual during different observation periods and find out potential causal relationships between observation variables and the event of interest. The individual industrial and commercial customer survival analysis method is a method for searching for mining and constructing the relationship between the survival factors influencing the individual industrial and commercial customers and the survival risks of the individual industrial and commercial customers in the massive data samples. By researching the individual industrial and commercial customer survival analysis method, the relationship between the factors influencing the individual industrial and commercial customer survival and the industrial and commercial customer survival risk is established, and a foundation is provided for evaluating the individual industrial and commercial customer survival risk.
Regression random forests are an integrated algorithm formed by decision trees. How to select the characteristics with high relevance of factors influencing the survival of individual industrial and commercial customers and train a random forest model to realize the survival evaluation of the individual industrial and commercial customers is a problem to be solved at present.
Disclosure of Invention
The invention aims to provide an individual industrial and commercial business survival analysis method to solve the technical problems.
In order to solve the technical problems, the specific technical scheme of the survival analysis method for the individual industrial and commercial customers is as follows:
an individual industrial and commercial customer survival analysis method, comprising the steps of:
s1, collecting survival historical data of regional individual industrial and commercial businesses to form sample data;
s2, determining an independent variable X and a dependent variable Y;
s3, preprocessing the sample data;
s4, selecting the characteristics of the independent variable X by using a mutual information method to obtain final sample data S _ E;
s5, carrying out data set division on the sample data S _ E to obtain a training set and a test set;
s6, performing model training by using a regression random forest model;
and S7, quantitatively evaluating the trained model.
Further, in step S3, the specific step of preprocessing the sample data is:
s31, deleting the characteristic of the data sample represented by the unique attribute, namely the number;
s32, aiming at the missing data, if the missing proportion is larger than 50% of the characteristics, deleting the characteristics when the model is constructed; missing the characteristics with the proportion less than 50%, processing continuous numerical value samples by using a mean filling method, and processing discrete numerical value samples by using a mode filling method;
and S33, deleting the approximately repeated records in the data set aiming at the redundant characteristic data, and deleting the redundant data.
Further, the calculation of step S4 is specifically: and calculating the influence of each characteristic in the preprocessed sample on the dependent variable Y by a mutual information method, sorting according to the influence value, selecting a plurality of characteristics with higher mutual confidence values as model input X _ I, and obtaining final sample data S _ E.
Further, the mutual information method in step S4 is: a measure for measuring the interdependence between two random variables, i.e. the degree of uncertainty reduction of random variable X to random variable Y, is represented by I (X; Y) and has a specific formula
Figure BDA0003844820990000021
Wherein p (X) is X = X i Probability of occurrence, p (Y) is Y = Y i Probability of occurrence, p (X, y) is X = X i And Y = Y i The joint probability of (c).
Further, in step S5, the final sample data S _ E is randomly divided in a proportion of 7.
Further, in step S6, a regression random forest model is used for prediction, a model hyper-parameter is determined, and a training set is input for model training.
Further, the regression random forest model in step S6 is implemented as: in a training stage, a plurality of different sub-training data sets are collected from an input training data set by using a bootstrap sampling algorithm to train a plurality of different decision trees in sequence; and in the prediction stage, taking the average value of the prediction results of the internal decision trees as a model output result.
Further, in step S7, the model is quantitatively evaluated using the root mean square error RSME and r2_ score.
The survival analysis method for the individual industrial and commercial customers has the following advantages:
1. the method comprises the steps of analyzing factors influencing the survival of the individual industrial and commercial customers by using a machine learning method, selecting features by using mutual information values, realizing an analysis method by combining a regression random forest model, and reasonably judging the states of logout risk, survival time and the like of the individual industrial and commercial customers.
2. The relationship between the influence factors influencing the individual industrial and commercial customers and the life lives of the industrial and commercial customers is established, and the survival analysis of the individual industrial and commercial customers is realized.
Drawings
FIG. 1 is a complete flow chart of the survival analysis method for individual industrial and commercial businesses of the present invention;
FIG. 2 is a sample overall characteristic diagram of an embodiment of the invention;
FIG. 3 is a sample data diagram after a delete feature according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of the final sample data of an embodiment of the present invention;
FIG. 5 is a schematic representation of features screened by an embodiment of the present invention;
FIG. 6 is a graph of test results of test samples 2000-3000 according to embodiments of the present invention.
Detailed Description
In order to better understand the purpose, structure and function of the present invention, the method for analyzing the survival of an individual industrial and commercial customer according to the present invention is described in further detail below with reference to the accompanying drawings.
As shown in fig. 1, the method for analyzing survival of individual industrial and commercial businesses based on mutual information values and random forest models, provided by the invention, comprises the following steps:
s1, collecting survival history data of individual industrial and commercial businesses in a region to form sample data;
s2, determining an independent variable X and a dependent variable Y;
s3, preprocessing the sample data;
the method for preprocessing the sample data comprises the following specific steps:
and S31, deleting the characteristic of the data sample represented by the unique attribute, namely the number.
And S32, aiming at the missing data, if the missing proportion is more than 50% of the characteristics, deleting the characteristics during model construction. And (3) missing the characteristics with the proportion less than 50%, processing continuous numerical value samples by using a mean filling method, and processing discrete numerical value samples by using a mode filling method.
And S33, deleting the approximately repeated records in the data set aiming at the redundant characteristic data, and deleting the redundant data.
S4, selecting the characteristics of the independent variable X by using a mutual information method to obtain final sample data S _ E;
and calculating the influence of each characteristic in the preprocessed sample on the dependent variable Y by a mutual information method, sorting according to the influence value, selecting a plurality of characteristics with higher mutual confidence values as model input X _ I, and obtaining final sample data S _ E.
The mutual information method comprises the following steps: a measure of the interdependence between two random variables, i.e., the degree of uncertainty reduction of random variable X over random variable Y, is denoted by I (X; Y). The specific calculation formula is
Figure BDA0003844820990000041
Wherein p (X) is X = X i Probability of occurrence, p (Y) is Y = Y i Probability of occurrence, p (X, y) is X = X i And Y = Y i The joint probability of (c).
S5, dividing a data set of the sample data S _ E to obtain a training set and a test set;
and (3) randomly dividing the final sample data S _ E according to the proportion of 7 to obtain a training set and a test set.
S6, performing model training by using a regression random forest model;
and predicting by using a regression random forest model, determining a model hyper-parameter, and inputting a training set to train the model.
The regression random forest model is realized as follows: in a training stage, a plurality of different sub-training data sets are collected from an input training data set by using a bootstrap sampling algorithm to train a plurality of different decision trees in sequence; and in the prediction stage, taking the average value of the prediction results of the internal decision trees as a model output result.
And S7, quantitatively evaluating the trained model.
The model was quantitatively evaluated using root mean square error (RSME) and r2_ score.
In this embodiment, an individual industrial and commercial customer survival analysis method based on mutual information values and a random forest model includes the following specific steps:
s1, collecting survival history data of individual industrial and commercial businesses in a region to form sample data, wherein 174806 samples and 42 characteristics are collected in total in the embodiment, and the specific characteristics are shown in FIG. 2.
And S2, dividing the independent variable X and the dependent variable Y of the data sample collected in the step S1, wherein in the embodiment, the characteristic operation duration is taken as the dependent variable Y, and the remaining 41 characteristics are taken as the independent variable X.
And S3, because the independent variable X in the step S2 has the unique attribute characteristic and the redundant characteristic in the collection process in the step S1, preprocessing the collected sample data in the step, deleting the repeated samples, the characteristic representing the unique attribute and the redundant characteristic, and leaving 1699 samples and 24 characteristics as shown in FIG. 3. And carrying out one-hot coding processing on the unordered discrete features. Further, a data sample set after preprocessing, including 16992 samples and 40 features, as shown in fig. 4 is obtained.
And S4, performing feature sorting on the independent variable X preprocessed in the step S3 by using a mutual information method, selecting the first 12 features with higher importance shown in the figure 5 as model input X _ I, and obtaining final sample data S _ E.
S5, randomly dividing a data set of the sample data S _ E to obtain a training set and a testing set, wherein the sample quantity proportion of the training set to the testing set is 7.
S6, model construction is carried out by using a regression random forest model, the number of the hyper-parameter trees is set to be 100, a training set is input, and 10% of data sets are divided to carry out 5-fold cross validation for model training.
And S7, inputting the test set for testing, wherein partial results are shown in FIG. 6. The model was quantitatively evaluated using root mean square error (RSME) and r2_ score with 0.017 root mean square error on the training set, 0.999 r2_ score, 0.104 root mean square error on the test set, and 0.995 r2_ score.
The complete method process of the invention is realized by using jupyter notebook software, and the embodiment shows that the invention builds an individual industrial and commercial customer survival analysis model and realizes the survival time prediction and risk evaluation of the individual industrial and commercial customers.
It is to be understood that the present invention has been described with reference to certain embodiments and that various changes in form and details may be made therein by those skilled in the art without departing from the spirit and scope of the invention. In addition, many modifications may be made to adapt a particular situation or material to the teachings of the invention without departing from the essential scope thereof. Therefore, it is intended that the invention not be limited to the particular embodiment disclosed, but that the invention will include all embodiments falling within the scope of the appended claims.

Claims (8)

1. An individual industrial and commercial customer survival analysis method, characterized in that the method comprises the steps of:
s1, collecting survival historical data of regional individual industrial and commercial businesses to form sample data;
s2, determining an independent variable X and a dependent variable Y;
s3, preprocessing the sample data;
s4, selecting the characteristics of the independent variable X by using a mutual information method to obtain final sample data S _ E;
s5, dividing a data set of the sample data S _ E to obtain a training set and a test set;
s6, performing model training by using a regression random forest model;
and S7, quantitatively evaluating the trained model.
2. The survival analysis method for the individual industrial and commercial businesses according to claim 1, wherein in step S3, the specific step of preprocessing the sample data is:
s31, deleting the characteristic of the data sample represented by the unique attribute, namely the number;
s32, aiming at the missing data, if the missing proportion is larger than 50% of the characteristics, deleting the characteristics when the model is constructed; the method comprises the following steps of (1) processing continuous numerical value samples by using a mean filling method and processing discrete numerical value samples by using a mode filling method when the missing proportion is less than 50%;
and S33, deleting the approximately repeated records in the data set aiming at the redundant characteristic data, and deleting the redundant data.
3. The individual industrial and commercial customer survival analysis method according to claim 1, wherein the calculation of step S4 is specifically: and calculating the influence of each characteristic in the preprocessed sample on the dependent variable Y by a mutual information method, sequencing according to the influence value, selecting a plurality of characteristics with higher mutual confidence values as a model input X _ I, and obtaining final sample data S _ E.
4. The survival analysis method for the individual industrial and commercial businesses according to claim 1, wherein the mutual information method in step S4 is: a measure of the interdependence between two random variables, i.e. the uncertainty of the random variable X over the random variable Y, minusA small degree is expressed by I (X; Y), and the specific calculation formula is
Figure FDA0003844820980000011
Wherein p (X) is X = X i Probability of occurrence, p (Y) is Y = Y i Probability of occurrence, p (X, y) is X = X i And Y = Y i The joint probability of (c).
5. The individual industrial and commercial customer survival analysis method according to claim 1, wherein in step S5, the final sample data S _ E is randomly divided in a ratio of 7.
6. The individual industrial and commercial customer survival analysis method according to claim 1, wherein in step S6, a regression random forest model is used for prediction, a model hyper-parameter is determined, and a training set is input for model training.
7. The individual industrial and commercial customer survival analysis method according to claim 1, wherein the regression random forest model in step S6 is implemented as: in the training stage, a bootstrap sampling algorithm is used for collecting a plurality of different sub-training data sets from an input training data set to train a plurality of different decision trees in sequence; and in the prediction stage, taking the average value of the prediction results of the internal decision trees as a model output result.
8. The method for analyzing survival of individual industrial businesses as claimed in claim 1, wherein in step S7, the model is quantitatively evaluated using root mean square error RSME and r2_ score.
CN202211114151.0A 2022-09-14 2022-09-14 Survival analysis method for individual industrial and commercial customers Pending CN115526386A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211114151.0A CN115526386A (en) 2022-09-14 2022-09-14 Survival analysis method for individual industrial and commercial customers

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211114151.0A CN115526386A (en) 2022-09-14 2022-09-14 Survival analysis method for individual industrial and commercial customers

Publications (1)

Publication Number Publication Date
CN115526386A true CN115526386A (en) 2022-12-27

Family

ID=84696879

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211114151.0A Pending CN115526386A (en) 2022-09-14 2022-09-14 Survival analysis method for individual industrial and commercial customers

Country Status (1)

Country Link
CN (1) CN115526386A (en)

Similar Documents

Publication Publication Date Title
CN111222982A (en) Internet credit overdue prediction method, device, server and storage medium
CN112465393A (en) Enterprise risk early warning method based on correlation analysis FP-Tree algorithm
CN108491991B (en) Constraint condition analysis system and method based on industrial big data product construction period
CN114048436A (en) Construction method and construction device for forecasting enterprise financial data model
WO2024067387A1 (en) User portrait generation method based on characteristic variable scoring, device, vehicle, and storage medium
CN111738843B (en) Quantitative risk evaluation system and method using running water data
CN110909895A (en) Early warning method and system based on special equipment historical periodic inspection report
CN114429245A (en) Analysis display method of engineering cost data
CN115310752A (en) Energy big data-oriented data asset value evaluation method and system
Sharma et al. Forecasting and prediction of air pollutants concentrates using machine learning techniques: the case of India
Muradian et al. A framework for assessing which sampling programmes provide the best trade-off between accuracy and cost of data in stock assessments
CN112151185A (en) Child respiratory disease and environment data correlation analysis method and system
CN106778252A (en) Intrusion detection method based on rough set theory Yu WAODE algorithms
CN115526386A (en) Survival analysis method for individual industrial and commercial customers
Yang et al. Data preparation for machine learning in rock engineering
KR101629178B1 (en) Apparatus for technology life analysis using multiple patent indicators
CN115204501A (en) Enterprise evaluation method and device, computer equipment and storage medium
CN113947309A (en) Shield tunnel construction standard working hour measuring and calculating and scoring method based on big construction data
CN114663102A (en) Method, equipment and storage medium for predicting debt subject default based on semi-supervised model
CN113888318A (en) Risk detection method and system
CN113657726A (en) Personnel risk analysis method based on random forest
Seelam et al. Comparative study of predictive models to estimate employee attrition
CN112070336A (en) Manufacturing industry information quantitative analysis method and device based on analytic hierarchy process
CN111967937A (en) E-commerce recommendation system based on time series analysis and implementation method
Mittal et al. Automated Disease Prediction Using Machine Learning Technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination