CN116579842A

CN116579842A - Credit data analysis method and system based on user behavior data

Info

Publication number: CN116579842A
Application number: CN202310854274.6A
Authority: CN
Inventors: 刘晓光; 王潇霏; 王刚; 陈静怡; 王文蕊; 赵思浓
Original assignee: Nankai University
Current assignee: Nankai University
Priority date: 2023-07-13
Filing date: 2023-07-13
Publication date: 2023-08-11
Anticipated expiration: 2043-07-13
Also published as: CN116579842B

Abstract

The invention relates to the technical field of data processing, and discloses a credit data analysis method and system based on user behavior data, which are used for improving the accuracy rate of credit data analysis. Comprising the following steps: collecting a plurality of user behavior data and performing tag matching to determine tag data; integrating the data of the plurality of user behavior data and the tag data to obtain a user data set; performing data processing on the user data set to obtain a data set to be analyzed; performing first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set; performing second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set; performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set; and carrying out credit data analysis on the target feature set to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal.

Description

Credit data analysis method and system based on user behavior data

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a credit data analysis method and system based on user behavior data.

Background

In recent years, the rapid development of internet finance has made user credit data analysis increasingly important. In addition, the data of the financial attribute strongly related to the user is difficult to acquire, the acquisition cost is high, the acquired effective data is limited, and the construction of a credit data analysis system with high accuracy is difficult. With the rapid development of internet finance, the dimension of data is explosively increased, so that the data has the characteristic of high-dimension sparseness. In addition, in wind control modeling, the structured data is heavy in cleaning and processing, the data transformation has matrix sparseness, so that loss information is excessive, feature extraction is difficult, and meanwhile, the data with higher dimensionality exceeds the data range which can be processed by the traditional grading card model.

However, machine learning models have significant advantages for data modeling with the above features. On the one hand, the machine learning model can help to screen irrelevant and redundant characteristic data affecting modeling effect in the data. The dimension of the data can be effectively reduced through feature selection, the calculation complexity of the model is reduced, and the calculation speed and the calculation precision of the model are improved. On the other hand, the machine learning model can also find rules and modes in high-dimensional sparse data, and has stronger generalization capability. The prediction and classification performance of the model can be effectively improved through machine learning modeling, and meanwhile the situation that the model is over-fitted is prevented.

At the same time, the data drift problem creates great difficulties in practical production of machine learning models in recent years. Data drift is the gradual change in distribution of data over time or space, with data that needs to be predicted or validated and data distribution for training exhibiting significant shifts, which can significantly reduce the predictive performance of the system model. Therefore, accuracy is low in credit data analysis based on user behavior data.

Disclosure of Invention

In view of the above, the embodiment of the invention provides a credit data analysis method and a credit data analysis system based on user behavior data, which solve the technical problem of lower accuracy in credit data analysis based on user behavior data.

The invention provides a credit data analysis method based on user behavior data, which comprises the following steps: collecting a plurality of user behavior data, performing tag matching on the plurality of user behavior data, and determining tag data corresponding to each user behavior data; performing data integration on the plurality of user behavior data and the tag data corresponding to each user behavior data to obtain a user data set; performing data preprocessing on the user data set to obtain a data set to be analyzed; performing first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set; performing second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set; performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set; and carrying out credit data analysis on the target feature set through a preset target credit data analysis model to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal.

In the present invention, the step of collecting a plurality of user behavior data, performing tag matching on a plurality of user behavior data, and determining tag data corresponding to each user behavior data includes: collecting a plurality of user behavior data, extracting time data from each user behavior data, and determining time data corresponding to each user behavior data; and carrying out tag matching on a plurality of user behavior data based on the time data corresponding to each user behavior data, and determining the tag data corresponding to each user behavior data.

In the present invention, the step of performing data preprocessing on the user data set to obtain a data set to be analyzed includes: performing outlier analysis on the user data set, determining a target outlier, and performing missing value analysis on the user data set through the outlier to determine a target missing value; and based on the target missing value, performing data filling processing on the user data set to obtain a data set to be analyzed.

In the present invention, the step of performing a first feature extraction process on the data set to be analyzed by using a filtering feature extraction algorithm to obtain a first candidate feature set includes: redundant feature elimination is carried out on the data set to be analyzed through the filtering type feature extraction algorithm, and a feature set to be processed is obtained; performing feature correlation analysis on the feature set to be processed to obtain feature correlation analysis results; and extracting the characteristics of the to-be-processed characteristic set according to the characteristic correlation analysis result to obtain a first candidate characteristic set.

In the present invention, the step of performing a second feature extraction process on the first candidate feature set by using a wrapped feature extraction algorithm to obtain a second candidate feature set includes: carrying out importance analysis on each first candidate feature in the first candidate feature set through a wrapped feature extraction algorithm, and determining importance data of each first candidate feature; and carrying out second feature extraction processing on the first candidate feature set based on the importance data of each first candidate feature to obtain a second candidate feature set.

In the present invention, the step of performing data drift detection and feature screening on the second candidate feature set to obtain a target feature set includes: carrying out data drift detection on the second candidate feature set through a preset countermeasure classifier to generate a data drift detection result; and performing feature screening processing on the second candidate feature set through the data drift detection result to obtain a target feature set.

In the present invention, after the step of performing data drift detection and feature screening on the second candidate feature set to obtain a target feature set, before the step of performing credit data analysis on the target feature set by using a preset target credit data analysis model to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal, the method includes: performing initial hyper-parameter analysis on the initial credit data analysis model to determine an initial hyper-parameter combination; carrying out prior probability distribution analysis on the initial super-parameter combination to determine prior probability distribution data; model training is carried out on the initial credit data analysis model through the second candidate feature set, and a training set and a testing set are generated; the posterior probability distribution analysis is carried out on the initial super-parameter combination through the training set and the testing set, and posterior probability distribution data are determined; performing iterative analysis on the initial super-parameter combination based on the posterior probability distribution data to determine an optimal super-parameter combination; and carrying out parameter configuration on the initial credit data analysis model based on the optimal super-parameter combination to obtain the target credit data analysis model.

The invention also provides a credit data analysis system based on the user behavior data, which comprises:

the data acquisition module is used for acquiring a plurality of user behavior data, performing tag matching on the plurality of user behavior data and determining tag data corresponding to each user behavior data;

the data integration module is used for integrating the plurality of user behavior data and the tag data corresponding to each user behavior data to obtain a user data set;

the data processing module is used for carrying out data preprocessing on the user data set to obtain a data set to be analyzed;

the first extraction module is used for carrying out first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set;

the second extraction module is used for carrying out second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set;

the feature screening module is used for carrying out data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set;

and the credit analysis module is used for carrying out credit data analysis on the target feature set through a preset target credit data analysis model to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal.

In the technical scheme provided by the invention, a plurality of user behavior data are collected and label matching is carried out, and corresponding label data are determined; integrating the data of the plurality of user behavior data and the tag data to obtain a user data set; carrying out data preprocessing on the user data set to obtain a data set to be analyzed; performing first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set; performing second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set; performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set; the credit data analysis is carried out on the target feature set to obtain a credit data analysis result and the credit data analysis result is transmitted to the preset data processing terminal, in the embodiment of the invention, the influence of the user behavior data on the credit condition is focused, the credit data analysis system based on the user behavior data with higher accuracy is established under the condition that the data of financial attributes which are high in cost and not easy to acquire are not required to be acquired, on the one hand, in the embodiment of the invention, the category type features are directly converted into numerical type features, the operation such as single-heat coding is not required to be carried out on the category type features, the data dimension is not increased, and the method is rapid and efficient. On the other hand, the method reduces the influence of estimation deviation and solves the problems of gradient deviation and prediction deviation by carrying out unbiased estimation on the gradient compared with the traditional gradient estimation method, thereby effectively improving the generalization capability of the system model. Therefore, the credit condition of the user can be predicted at a higher training speed, and the credit condition prediction method has more accurate prediction capability and better generalization performance, so that the accuracy of credit data analysis based on the user behavior data is further improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

Fig. 1 is a flowchart of a credit data analysis method based on user behavior data in an embodiment of the invention.

FIG. 2 is a ten-bit distribution diagram of redundant features in an embodiment of the present invention.

FIG. 3 is a ten-bit distribution diagram of non-redundant features in an embodiment of the invention.

Fig. 4 is a flowchart of a second feature extraction process performed on a first candidate feature set by a wrapped feature extraction algorithm in an embodiment of the present invention.

FIG. 5 is a graph showing the importance of the filtered feature selection remaining features in an embodiment of the present invention.

Fig. 6 is a schematic diagram of a credit data analysis system based on user behavior data in an embodiment of the invention.

Reference numerals:

3001. a data acquisition module; 3002. a data integration module; 3003. a data processing module; 3004. a first extraction module; 3005. a second extraction module; 3006. a feature screening module; 3007. a credit analysis module; 3008. a parameter analysis module; 3009. a distribution analysis module; 3010. a model training module; 3011. a probability analysis module; 3012. an iteration analysis module; 3013. and a parameter configuration module.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, the terms "first," "second," "third," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Referring to fig. 1, fig. 1 is a flowchart of a credit data analysis method based on user behavior data according to an embodiment of the invention, as shown in fig. 1, including the following steps:

s101, acquiring a plurality of user behavior data, performing tag matching on the plurality of user behavior data, and determining tag data corresponding to each user behavior data;

it should be noted that, the user behavior data includes App information used by the user, device information used by the user and recent position movement information of the user, further, tag matching is performed on a plurality of user behavior data, and tag data corresponding to each user behavior data is determined. Before classification, because the App data actually collected have the messy code phenomenon and the authenticity is to be verified, firstly, aiming at the messy code phenomenon, deleting the App data with the character length larger than 20, then using the commonly used Chinese characters to carry out App name matching, reserving the App data with higher matching degree, and completing the messy code processing through the operation.

TABLE 1 specific meaning of each classified partial feature

And secondly, the invention refers to a domestic professional mobile popularization data analysis platform to download real App data so as to verify the authenticity of the App data acquired by the invention. Finally, counting the use frequency of single App of all current users, wherein the use frequency of almost 83% of App in 319,071 Apps is only 1 time, so that the invention reserves the first 5 ten thousand App data with higher use frequency. The device information used by the user specifically comprises data such as the price of the device on the market, the year, the time of last activity of the device, the number of days of activity of the device in the data collection period, and the like. The position movement information of the user in the acquisition time period is roughly positioned as provincial counties to which the user belongs, and the longitude and latitude information of the user position can be accurately acquired. And processing the user behavior data based on certain processing logic according to the three actually collected behavior data. The data collection and processing work consumes a great deal of time and labor, but the credibility, the true effectiveness and the universality of the data have very remarkable advantages. The true and accurate data source is the key point of modeling and is also the basis for long-term application of the system model.

It should be noted that the behavior data features in the data set to be analyzed include three parts: 1) App information used by the user includes four broad categories of financial App usage preferences, other App usage preferences, financial tags, and other tags. Specifically, app usage preference class features refer to the installation, addition, removal of the amount of money, and the number of days of activity for each App for the user device for approximately 7, 15, 30, or 90 days. The tag type features refer to the opening times of users of certain types of features; 2) The equipment information used by the user comprises the market price, the year, the market time and the year of the equipment used by the user and the number of MAC addresses corresponding to the equipment in the last 30 days; 3) The user recent location movement information includes the number of recent occurrences of the user at the convenience store. The user behavior data contains 94 features in total of six categories, as shown in table 1, showing in detail the specific meaning of each category part feature.

S102, integrating a plurality of user behavior data and tag data corresponding to each user behavior data to obtain a user data set;

specifically, data merging processing is performed on a plurality of user behavior data and label data corresponding to each user behavior data to obtain a user data set, wherein credit record data of the same batch of users is obtained, a user expression period is defined as three months, positive samples are defined if time data in the credit record data of the users in the expression period exceeds a preset first threshold value, negative samples are defined if time data in the credit record data of the users does not exceed a preset second threshold value, and modeling data only need to obtain data which are definitely defined as the positive samples and the negative samples. The time range of this dataset was 2021, 9, 1 to 2021, 12, 31, including 142,793 pieces of data. The results of the dataset integration partitioning are specifically shown in table 2.

TABLE 2 data set partitioning results

Note that in table 2, OOT is a cross-time validation set, which is the last segment of the modeled sample time slice.

S103, carrying out data preprocessing on the user data set to obtain a data set to be analyzed;

when the user data set is preprocessed, firstly integrating user behavior data and whether default label data to form a final data set, meanwhile completing division of the data set, then performing outlier processing and missing value processing, and determining a filling method of missing values according to model effects.

Specifically, when data preprocessing is performed on a user data set, performing outlier analysis on the user data set, determining a target outlier, performing missing value analysis on the user data set through the outlier, determining a target missing value, and performing data filling processing on the user data set based on the target missing value to obtain the data set to be analyzed.

It should be noted that, because the feature data with higher missing degree affects the modeling effect, a strategy of directly deleting the feature columns with the column feature missing degree higher than 80% is adopted, then a box graph is adopted, and the detection and the determination of the outlier are carried out by combining with expert experience, the data determined as the outlier are treated as the missing value, finally the missing value filling is carried out by adopting a plurality of methods such as fixed value filling, mean value filling, last data, interpolation filling and the like, the missing value of each column is determined according to the system model effect, and the current column mean value is adopted for filling.

S104, performing first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set;

s105, performing second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set;

in the embodiment of the invention, the characteristic selection method adopts two methods of a filtering method and a wrapping method. The method comprises the steps of screening redundant features through three statistical methods of ten-bit distribution, rank and test and standard, wherein in order to select features related to modeling, the three statistical methods of ten-bit distribution, rank and test and standard are adopted to screen the redundant features; and then, from the angles of linearity and nonlinearity of the features, updating the feature set by adopting a Pearson correlation system method and a maximum information coefficient method, and finally, completing first feature extraction processing of the data set to be analyzed to obtain a first candidate feature set.

Further, second feature extraction processing is performed on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set, wherein when the second feature extraction processing is performed on the first candidate feature set through the wrapped feature extraction algorithm, the server determines a better candidate feature set by combining a classification tree model feature importance scoring method, and finally the better candidate feature set is used as the second candidate feature set.

S106, carrying out data drift detection and feature screening treatment on the second candidate feature set to obtain a target feature set;

specifically, in the embodiment of the present invention, the HyperGBM is used to detect and process the data drift of the user data, and it should be noted that the HyperGBM is a full Pipeline automatic machine learning tool, and can completely cover the whole process from end to end, including data cleaning, preprocessing, feature processing and screening, and model selection and super parameter optimization, and simultaneously perform feature screening processing to obtain the target feature set.

S107, credit data analysis is carried out on the target feature set through a preset target credit data analysis model, so that a credit data analysis result is obtained, and the credit data analysis result is transmitted to a preset data processing terminal.

By executing the steps, collecting a plurality of user behavior data and performing tag matching to determine corresponding tag data; integrating the data of the plurality of user behavior data and the tag data to obtain a user data set; carrying out data preprocessing on the user data set to obtain a data set to be analyzed; performing first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set; performing second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set; performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set; and carrying out credit data analysis on the target feature set to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal.

In the embodiment of the invention, the influence of the user behavior data on the credit condition is focused, and a credit data analysis system based on the user behavior data with higher accuracy is established under the condition that the data of financial attributes which are high in cost and difficult to acquire are not required to be acquired. On the one hand, in the embodiment of the invention, the category type characteristics are directly converted into the numerical type characteristics, and operations such as single-heat coding and the like are not needed to be carried out on the category type characteristics, so that the data dimension is prevented from being increased, and the method is rapid and efficient. On the other hand, the method reduces the influence of estimation deviation and solves the problems of gradient deviation and prediction deviation by carrying out unbiased estimation on the gradient compared with the traditional gradient estimation method, thereby effectively improving the generalization capability of the system model. Therefore, the credit condition of the user can be predicted at a higher training speed, and the credit condition prediction method has more accurate prediction capability and better generalization performance, so that the accuracy of credit data analysis based on the user behavior data is further improved.

In a specific embodiment, the process of executing step S101 may specifically include the following steps:

(1) Collecting a plurality of user behavior data, extracting time data from each user behavior data, and determining time data corresponding to each user behavior data;

(2) And carrying out tag matching on the plurality of user behavior data based on the time data corresponding to each user behavior data, and determining the tag data corresponding to each user behavior data.

The user behavior data includes App information used by the user, device information used by the user, and recent position movement information of the user. Further, tag matching is performed on the plurality of user behavior data, and tag data corresponding to each user behavior data is determined. In the embodiment of the invention, the user expression period is defined as three months, and is defined as a positive sample if the time data in the credit record data of the user exceeds a preset first threshold value in the expression period, and is defined as a negative sample if the time data in the credit record data of the user does not exceed a preset second threshold value.

In a specific embodiment, the process of executing step S103 may specifically include the following steps:

(1) Performing outlier analysis on the user data set, determining a target outlier, and performing missing value analysis on the user data set through the outlier to determine a target missing value;

(2) And based on the target missing value, performing data filling processing on the user data set to obtain the data set to be analyzed.

In this step, it should be noted that the quality of the data can directly determine the prediction and generalization capabilities of the system, and the data preprocessing is a precondition for guaranteeing the quality of the data, so the data preprocessing is crucial for modeling work. The invention considers that the characteristic data with extremely high missing degree can influence the modeling effect because the data collection time span is large and the collection mode is complex, and the problem of high data missing degree is unavoidable, so that a strategy of directly deleting the characteristic columns with the column characteristic missing degree higher than 80 percent is adopted. Through the operation, 18 characteristic columns with extremely high deletion degree are deleted in total, a data set containing 76 characteristics is obtained, and for the rest data, the data belong to numerical values, so that a box diagram is firstly drawn for detecting abnormal values, and then the abnormal values are determined by combining expert experience. Because there is no correlation between users, there is a high probability that the number of money for downloading and installing an App of a certain type is very large or small in a certain period of time of the user, and therefore, the abnormal value detected for the box diagram needs to be determined according to expert experience. If the data is determined to be an outlier, the invention uses 7 methods of fixed value filling, mean filling, median filling, mode filling, interpolation filling, last data filling and next data filling as the missing value processing. According to the model effect, the filling effect of the four methods of fixed value filling, mean filling, median filling and mode filling is obviously better than that of interpolation filling, last data filling and next data filling by about 5 to 10 percent, because there is almost no correlation between the user data of the invention, and the method of supplementing the current user data by the last user or the next user data is not suitable. The mean filling model effect of the four methods with better filling effect is about 1% to 2% better than the other three methods. The missing value filling method of the present invention therefore selects column mean filling.

Finally, the abnormal value analysis is carried out on the user data set, the target abnormal value is determined, the missing value analysis is carried out on the user data set through the abnormal value, the target missing value is determined, and the data filling processing is carried out on the user data set based on the target missing value, so that the data set to be analyzed is obtained.

In a specific embodiment, the process of executing step S104 may specifically include the following steps:

(1) Redundant feature elimination is carried out on the data set to be analyzed through a filtering type feature extraction algorithm, and a feature set to be processed is obtained;

(2) Performing feature correlation analysis on the feature set to be processed to obtain feature correlation analysis results;

(3) And extracting the characteristics of the to-be-processed characteristic set through the characteristic correlation analysis result to obtain a first candidate characteristic set.

It should be noted that, the feature extraction is to select features helpful to modeling, and the invention adopts two methods of filtering and wrapping to screen features irrelevant to modeling. Wherein, redundant characteristics increase the calculation amount of the model, slow down the training speed and even have the possibility of generating overfitting. The part of characteristics are screened, so that unnecessary resource consumption can be reduced, and the prediction performance of the system model is improved. The three feature selection methods based on statistics of ten-bit distribution, rank sum test and standard are aimed at screening redundant features, in addition, the degree of correlation between two variables can also be used as a basis for feature screening, the stronger the correlation between the variables is, the larger the information quantity contained by the two variables is, the more correlated features are selected, so that the feature set is updated by adopting the pearson correlation coefficient method and the maximum information coefficient method from the aspects of linearity and nonlinearity of the features. As shown in table 3, the correlation coefficient is a strong-weak correlation between the value of the correlation coefficient and the feature. In the invention, redundant feature elimination is carried out on a data set to be analyzed through a filtering type feature extraction algorithm to obtain a feature set to be processed, feature correlation analysis is carried out on the feature set to be processed to obtain a feature correlation analysis result, and feature extraction is carried out on the feature set to be processed through the feature correlation analysis result to obtain a first candidate feature set.

TABLE 3 correlation between the values of the correlation coefficients and the correlation between the features

The tenth distribution is a feature selection method based on the tenth, and can intuitively reflect the distinguishing effect of each feature on the negative sample and the positive sample. FIGS. 2 and 3 show the ten-bit distribution diagrams of the two redundant features and the non-redundant features, wherein naw in FIG. 2 shows the installed amount of the APP in the express stream or other express stream within about 7 days, and Deciles in FIG. 2 shows the ten-bit number of the redundant features; ECA08 in fig. 3 represents the e-commerce or e-commerce online behavior or vertical e-commerce-digital 3C class APP user opening times in approximately 7 days, and in fig. 3, the details represent the number of tens of digits of the non-redundant feature. It can be observed from the graph that if the cumulative distribution of the negative sample and the positive sample of the redundant feature is the same, the feature has no effect on distinguishing the negative sample from the positive sample, indicating that the feature is a redundant feature; in contrast, if the cumulative distribution of the negative sample and the positive sample of the non-redundant feature is inconsistent, the feature can distinguish the negative sample from the positive sample, which indicates that the feature is not a redundant feature and the feature is reserved. The rank sum test and sum standard are divided into two statistical methods and ten-bit distribution, which belong to a parameter-free statistical method, and can intuitively reflect the distinguishing effect of each characteristic on a negative sample and a positive sample. And (3) combining three statistical methods, and eliminating 29 redundant features to obtain a candidate feature set consisting of 47 features. The pearson correlation coefficient is a linear correlation between different features measured by the covariance and standard deviation of two feature variables. The invention defines the value range of the pearson correlation coefficient as [0,1], and the strong and weak relation of the correlation between the value of the correlation coefficient and the characteristic is shown in table 3. By counting the extremely-strong correlated feature pairs, a plurality of extremely-strong correlated feature pairs can be found, and an extremely-strong correlated feature set can be constructed, namely any two features in the feature set can meet extremely-strong correlation. Some feature sets have individual feature pair correlation coefficients less than 0.8 but higher than 0.7, so this particular feature set is considered by the present invention to satisfy a very strongly correlated feature set. Further, calculating correlation coefficients among the features in the data to be analyzed, wherein the maximum information coefficient is obtained by calculating mutual information between two variables and combining probability MIC to measure nonlinear correlation among different features, the value range of the MIC value is defined as [0,1], and the value of the MIC and the strength of correlation among the features are similar to a Pearson correlation coefficient method. The statistics of the very strong linear and nonlinear correlation feature pairs and feature sets in the pearson correlation coefficient method and the maximum information coefficient method are shown in table 4. Finally, 22 extremely strong correlated features are removed by adopting a pearson correlation coefficient method and a maximum information coefficient method from the angles of linearity and nonlinearity of the features, and the number of updated feature sets is 25.

TABLE 4 extremely strong linear and nonlinear correlation feature pairs and feature set statistics

In a specific embodiment, as shown in fig. 4, the process of performing step S105 may specifically include the following steps:

s201, carrying out importance analysis on each first candidate feature in the first candidate feature set through a wrapped feature extraction algorithm, and determining importance data of each first candidate feature;

s202, performing second feature extraction processing on the first candidate feature set based on the importance data of each first candidate feature to obtain a second candidate feature set.

Specifically, the importance degree of each feature affecting the system decision can be intuitively reflected by calculating the number of times of dividing the attributes in all the trees. As shown in fig. 5, the importance distribution diagram of the remaining features is selected for the filtered features, wherein Feature Importance in the diagram represents the feature importance value of the classification tree model, and the first 20 features with higher feature importance are finally selected as the preferred feature set according to the effect of the Catboost model. Finally, a second feature extraction process is performed on the first candidate feature set based on the importance data of each first candidate feature, so as to obtain a second candidate feature set, and as shown in table 5, a description list of each feature in the second candidate feature set is obtained.

TABLE 5 list of descriptions of each feature in the second candidate feature set

It should be noted that, the second candidate feature, the classification to which the feature belongs, and the specific meaning thereof are shown in table 5, and the feature sequence is described from high to low according to the importance of the classification tree model. It can first be seen from table 5 that the market price of a user using a device is important to predict whether the user has good credit.

Statistics may show that about 73% of the users have good credit when the price of the device is higher than 2500 yuan, and that only 21% of the users have good credit when the price of the device is lower than 2500 yuan. Meanwhile, it can be derived from the statistics that about 67% of users have good credit when the number of recent occurrences of the user in the convenience store type shopping place exceeds 10 times, and about 26% of users have good credit when the number of recent occurrences of the user in the convenience store type shopping place is less than 10 times. In summary, when evaluating whether the user has good credit, the comprehensive evaluation can be performed in consideration of the price of the device used by the user or the number of times the user recently appears in the convenience store shopping place.

In a specific embodiment, the process of executing step S106 may specifically include the following steps:

(1) Carrying out data drift detection on the second candidate feature set through a preset countermeasure classifier to generate a data drift detection result;

(2) And performing feature screening processing on the second candidate feature set through the data drift detection result to obtain a target feature set.

In a specific embodiment, after performing step S106, before performing step S107, the method further includes the following steps:

(1) Performing initial hyper-parameter analysis on the initial credit data analysis model to determine an initial hyper-parameter combination;

(2) Carrying out prior probability distribution analysis on the initial super-parameter combination, and determining prior probability distribution data;

(3) Model training is carried out on the initial credit data analysis model through the second candidate feature set, and a training set and a testing set are generated;

(4) The posterior probability distribution analysis is carried out on the initial super-parameter combination through the training set and the testing set, and posterior probability distribution data are determined;

(5) Performing iterative analysis on the initial super-parameter combination based on posterior probability distribution data to determine an optimal super-parameter combination;

(6) And carrying out parameter configuration on the initial credit data analysis model based on the optimal super-parameter combination to obtain a target credit data analysis model.

Specifically, in this step, the parameter adjustment mode adopts bayesian optimization, a posterior distribution of the objective function is estimated based on data by using bayesian theorem, and then a super-parameter combination of the next sample is selected according to the distribution. Based on the full utilization of the previous sampling point information, the method can better adjust the current parameters and quickly find the parameters which make the global maximum of the objective function. Compared with grid search, the Bayesian optimization iteration number is less, and the running speed is faster. Given a specific range of parameters, multiple parameters can be adjusted at a time, so bayesian optimization does not lead to dimensional explosion if the parameters are too large.

In the embodiment of the invention, initial super-parameter analysis is carried out on an initial credit data analysis model, and initial super-parameter combination is determined; carrying out prior probability distribution analysis on the initial super-parameter combination, and determining prior probability distribution data; model training is carried out on the initial credit data analysis model through the second candidate feature set, a training set and a testing set are generated, posterior probability distribution analysis is carried out on the initial super-parameter combination through the training set and the testing set, and posterior probability distribution data are determined; performing iterative analysis on the initial super-parameter combination based on posterior probability distribution data to determine an optimal super-parameter combination; and carrying out parameter configuration on the initial credit data analysis model based on the optimal super-parameter combination to obtain a target credit data analysis model. The model carries out parameter adjustment on important parameters such as learning rate, tree depth, sample sampling rate, column sampling rate and the like, and parameter setting range and final parameter adjustment result are shown in table 6.

TABLE 6 Bayesian optimization of parameter ranges and parameter results

Further, as shown in table 7, the system model initial effect training set KS is 0.1925 and the test set is 0.1523. After adopting Bayesian optimization for parameter adjustment, the system performance reaches a training set KS of 0.1728 and a testing set of 0.1638. The parameter tuning effectively reduces the system over-fitting phenomenon, and the test set effect is improved by about 7% compared with the initial system model KS.

/>

TABLE 7 comparison of Credit data analysis model Effect before and after parameter adjustment

The system evaluation index was KS (Kolmogorov-Smirnov), which is the maximum value of the absolute value of the difference between the cumulative positive and negative sample ratios for each bin. In a wind control system, the magnitude of the KS value represents the distinguishing degree of the system, and the larger the KS value is, the stronger the risk ordering capability of the system is. The system prediction capability refers to the prediction accuracy of the system, and the better the system prediction capability is, the stronger the system distinguishing capability is; the generalization capability of the system refers to the system predictive capability on new data sets with the same regularity; the stability of the system refers to the predicted outcome fluctuation condition of the system under different random sampling outcomes.

In the embodiment of the invention, the prediction performance of the five algorithms is compared on the user behavior data set, and as shown in table 8, the performance of different machine learning models is compared.

First, it can be found that the performance of the integrated model is significantly better than that of a single system. This is because the integrated model aims at reducing the prediction bias or variance of the system, and several models are combined according to a certain strategy to improve the prediction performance of the system. Secondly, the CatBOOST and LightGBM models KS adopting the boosting concept in the integrated model are obviously higher than the RandomForest adopting the bagging concept. On one hand, the GBDT algorithm can be combined with a plurality of base learners to effectively improve generalization and robustness of the system model, so that prediction accuracy of the model is improved, and Random Forest is only focused on improving generalization and robustness of the model. On the other hand, bagging builds different models based on the parallel idea, boosting is based on the serial idea, the accuracy is improved, and the training result of the last model is fully considered by the latter system. Finally, the Catboost model shows significant advantages over LightGBM in both training and testing phases, and has better generalization performance over OOT. The method is characterized in that compared with the LightGBM, the Catboost algorithm can process the type features rapidly and efficiently, and an Ordered boosting method is adopted to obtain unbiased estimation of the gradient, so that the problems of gradient deviation and prediction deviation are solved, and the prediction performance and generalization capability of a system model are effectively improved. In terms of training time, table 8 shows the average training time of the five models obtained by 50 training. The longest training of the random forest model, catboost, was observed first from Table 8. Because the random forest tree considers all features at each split, a longer training time is required. The Catboost advantage is realized by fast processing of classification features, and if more classification features exist in the data, the training time of the Catboost is greatly shortened. Secondly, for the LightGBM and the Catboost model with good prediction performance, although the training time of the Catboost model is about 10 times of that of the LightGBM model, the prediction performance of the Catboost model test set is 36.98% improved compared with that of the LightGBM, and the OOT performance is better than that of the LightGBM. Finally, a training time around the Catboost model 7s is acceptable in a practical production environment, as discussed by the expert.

In terms of test time, the five models shown in table 8 were tested for an average test time of 50 tests. From Table 8, it can be observed that the test times for the four models CatBoost, lightGBM, gaussianNaive Bayes and Gaussian Mixture Model show an order of magnitude advantage over Random Forest, with the training time advantage of the Catboost model also being more pronounced.

In summary, by comparing model effects of different algorithms, the credit data analysis system based on user behavior data is established by adopting Catboost, so that not only is excellent prediction capability shown, but also the system has better generalization capability and remarkable stability. As shown in Table 8, the KS of the test set reached 0.1638, which gave better generalization ability. The stability of the system refers to the predicted outcome fluctuation condition of the system under different random sampling outcomes. Because the invention sets two parameters of the sample and the sampling ratio of the characteristic in the Catboost model, the training objects in each iteration are different in the training process. The reason for this is that different random seeds are set. The stability of the system can thus be observed for changes in KS under different random seeds.

TABLE 8 comparison of different machine learning model performances

The foregoing has outlined rather broadly the more detailed description of the method according to the present invention in order that the detailed description of the invention and the examples that follow may be implemented to provide an additional understanding of the method according to the present invention and the concepts underlying the present invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

The embodiment of the invention also provides a credit data analysis system based on the user behavior data, as shown in fig. 6, the credit data analysis system based on the user behavior data specifically comprises:

the data acquisition module 3001 is configured to acquire a plurality of user behavior data, perform tag matching on the plurality of user behavior data, and determine tag data corresponding to each user behavior data;

the data integration module 3002 is configured to integrate the plurality of user behavior data and tag data corresponding to each piece of user behavior data to obtain a user data set;

the data processing module 3003 is configured to perform data preprocessing on the user data set to obtain a data set to be analyzed;

The first extraction module 3004 is configured to perform a first feature extraction process on the data set to be analyzed through a filtering feature extraction algorithm, so as to obtain a first candidate feature set;

the second extraction module 3005 is configured to perform a second feature extraction process on the first candidate feature set by using a wrapped feature extraction algorithm, so as to obtain a second candidate feature set;

the feature screening module 3006 is configured to perform data drift detection and feature screening on the second candidate feature set to obtain a target feature set;

the credit analysis module 3007 is configured to perform credit analysis on the target feature set through a preset target credit analysis model, obtain a credit analysis result, and transmit the credit analysis result to a preset data processing terminal.

Optionally, the data acquisition module 3001 is specifically configured to: collecting a plurality of user behavior data, extracting time data from each user behavior data, and determining time data corresponding to each user behavior data; and carrying out tag matching on a plurality of user behavior data based on the time data corresponding to each user behavior data, and determining the tag data corresponding to each user behavior data.

Optionally, the data processing module 3003 is specifically configured to: performing outlier analysis on the user data set, determining a target outlier, and performing missing value analysis on the user data set through the outlier to determine a target missing value; and based on the target missing value, performing data filling processing on the user data set to obtain a data set to be analyzed.

Optionally, the first extracting module 3004 is specifically configured to: redundant feature elimination is carried out on the data set to be analyzed through the filtering type feature extraction algorithm, and a feature set to be processed is obtained; performing feature correlation analysis on the feature set to be processed to obtain feature correlation analysis results; and extracting the characteristics of the to-be-processed characteristic set according to the characteristic correlation analysis result to obtain a first candidate characteristic set.

Optionally, the second extracting module 3005 is specifically configured to: carrying out importance analysis on each first candidate feature in the first candidate feature set through a wrapped feature extraction algorithm, and determining importance data of each first candidate feature; and carrying out second feature extraction processing on the first candidate feature set based on the importance data of each first candidate feature to obtain a second candidate feature set.

Optionally, the feature screening module 3006 is specifically configured to: carrying out data drift detection on the second candidate feature set through a preset countermeasure classifier to generate a data drift detection result; and performing feature screening processing on the second candidate feature set through the data drift detection result to obtain a target feature set.

Optionally, the credit data analysis system based on the user behavior data further comprises:

the parameter analysis module 3008 is configured to perform initial superparameter analysis on the initial credit data analysis model, and determine an initial superparameter combination;

the distribution analysis module 3009 is configured to perform prior probability distribution analysis on the initial hyper-parameter combination, and determine prior probability distribution data;

the model training module 3010 is configured to perform model training on the initial credit data analysis model through the second candidate feature set, so as to generate a training set and a test set;

the probability analysis module 3011 is configured to perform posterior probability distribution analysis on the initial superparameter combination through the training set and the test set, and determine posterior probability distribution data;

the iteration analysis module 3012 is configured to perform iteration analysis on the initial super-parameter combination based on the posterior probability distribution data, and determine an optimal super-parameter combination;

And the parameter configuration module 3013 is configured to perform parameter configuration on the initial credit data analysis model based on the optimal super-parameter combination to obtain the target credit data analysis model.

Through the cooperation of the modules, acquiring a plurality of user behavior data and performing tag matching to determine corresponding tag data; integrating the data of the plurality of user behavior data and the tag data to obtain a user data set; carrying out data preprocessing on the user data set to obtain a data set to be analyzed; performing first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set; performing second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set; performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set; and carrying out credit data analysis on the target feature set to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal. In the embodiment of the invention, the influence of the user behavior data on the credit condition is focused, and a credit data analysis system based on the user behavior data with higher accuracy is established under the condition that the data of financial attributes which are high in cost and difficult to acquire are not required to be acquired. On the one hand, in the embodiment of the invention, the category type characteristics are directly converted into the numerical type characteristics, and operations such as single-heat coding and the like are not needed to be carried out on the category type characteristics, so that the data dimension is prevented from being increased, and the method is rapid and efficient. On the other hand, the method reduces the influence of estimation deviation and solves the problems of gradient deviation and prediction deviation by carrying out unbiased estimation on the gradient compared with the traditional gradient estimation method, thereby effectively improving the generalization capability of the system model. Therefore, the credit condition of the user can be predicted at a higher training speed, and the credit condition prediction method has more accurate prediction capability and better generalization performance, so that the accuracy of credit data analysis based on the user behavior data is further improved.

The above embodiments are only for illustrating the technical aspects of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the embodiments, it should be understood by those of ordinary skill in the art that: modifications and equivalents may be made to the specific embodiments of the invention without departing from the spirit and scope of the invention, which is intended to be covered by the scope of the claims.

Claims

1. A credit data analysis method based on user behavior data, comprising:

collecting a plurality of user behavior data, performing tag matching on the plurality of user behavior data, and determining tag data corresponding to each user behavior data;

performing data integration on the plurality of user behavior data and the tag data corresponding to each user behavior data to obtain a user data set;

performing data preprocessing on the user data set to obtain a data set to be analyzed;

performing first feature extraction processing on the data set to be analyzed through a filtering type feature extraction algorithm to obtain a first candidate feature set;

performing second feature extraction processing on the first candidate feature set through a wrapped feature extraction algorithm to obtain a second candidate feature set;

Performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set;

and carrying out credit data analysis on the target feature set through a preset target credit data analysis model to obtain a credit data analysis result, and transmitting the credit data analysis result to a preset data processing terminal.

2. The credit data analysis method based on user behavior data according to claim 1, wherein the step of collecting a plurality of user behavior data and performing tag matching on a plurality of user behavior data to determine tag data corresponding to each of the user behavior data comprises:

collecting a plurality of user behavior data, extracting time data from each user behavior data, and determining time data corresponding to each user behavior data;

and carrying out tag matching on a plurality of user behavior data based on the time data corresponding to each user behavior data, and determining the tag data corresponding to each user behavior data.

3. The credit data analysis method based on user behavior data according to claim 1, wherein the step of performing data preprocessing on the user data set to obtain a data set to be analyzed includes:

Performing outlier analysis on the user data set, determining a target outlier, and performing missing value analysis on the user data set through the outlier to determine a target missing value;

and based on the target missing value, performing data filling processing on the user data set to obtain a data set to be analyzed.

4. The credit data analysis method based on user behavior data according to claim 1, wherein the step of performing a first feature extraction process on the data set to be analyzed by a filter feature extraction algorithm to obtain a first candidate feature set includes:

redundant feature elimination is carried out on the data set to be analyzed through the filtering type feature extraction algorithm, and a feature set to be processed is obtained;

performing feature correlation analysis on the feature set to be processed to obtain feature correlation analysis results;

and extracting the characteristics of the to-be-processed characteristic set according to the characteristic correlation analysis result to obtain a first candidate characteristic set.

5. The credit data analysis method based on user behavior data according to claim 1, wherein the step of performing a second feature extraction process on the first candidate feature set by a wrapped feature extraction algorithm to obtain a second candidate feature set includes:

Carrying out importance analysis on each first candidate feature in the first candidate feature set through a wrapped feature extraction algorithm, and determining importance data of each first candidate feature;

and carrying out second feature extraction processing on the first candidate feature set based on the importance data of each first candidate feature to obtain a second candidate feature set.

6. The credit data analysis method based on user behavior data according to claim 1, wherein the step of performing data drift detection and feature screening processing on the second candidate feature set to obtain a target feature set includes:

carrying out data drift detection on the second candidate feature set through a preset countermeasure classifier to generate a data drift detection result;

and performing feature screening processing on the second candidate feature set through the data drift detection result to obtain a target feature set.

7. The credit data analysis method based on user behavior data according to claim 1, wherein after the step of performing data drift detection and feature screening on the second candidate feature set to obtain a target feature set, performing credit data analysis on the target feature set by a preset target credit data analysis model to obtain a credit data analysis result, and before the step of transmitting the credit data analysis result to a preset data processing terminal, the method comprises:

Performing initial hyper-parameter analysis on the initial credit data analysis model to determine an initial hyper-parameter combination;

carrying out prior probability distribution analysis on the initial super-parameter combination to determine prior probability distribution data;

model training is carried out on the initial credit data analysis model through the second candidate feature set, and a training set and a testing set are generated;

the posterior probability distribution analysis is carried out on the initial super-parameter combination through the training set and the testing set, and posterior probability distribution data are determined;

performing iterative analysis on the initial super-parameter combination based on the posterior probability distribution data to determine an optimal super-parameter combination;

and carrying out parameter configuration on the initial credit data analysis model based on the optimal super-parameter combination to obtain the target credit data analysis model.

8. A credit data analysis system based on user behavior data for performing the credit data analysis method based on user behavior data according to any one of claims 1 to 7, comprising: