CN115222177A - Service data processing method and device, computer equipment and storage medium - Google Patents

Service data processing method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN115222177A
CN115222177A CN202110441885.9A CN202110441885A CN115222177A CN 115222177 A CN115222177 A CN 115222177A CN 202110441885 A CN202110441885 A CN 202110441885A CN 115222177 A CN115222177 A CN 115222177A
Authority
CN
China
Prior art keywords
characteristic
variable
variables
feature
combination
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110441885.9A
Other languages
Chinese (zh)
Inventor
路遥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Soxinda Beijing Data Technology Co ltd
Original Assignee
Soxinda Beijing Data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Soxinda Beijing Data Technology Co ltd filed Critical Soxinda Beijing Data Technology Co ltd
Priority to CN202110441885.9A priority Critical patent/CN115222177A/en
Publication of CN115222177A publication Critical patent/CN115222177A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"

Abstract

The application relates to a business data processing method, a business data processing device, a computer device and a storage medium. The method comprises the following steps: acquiring characteristic variables of the service data from a database; obtaining discrete characteristic variables according to the characteristic variables; then, sequencing the information gain rate, analyzing and processing the principal components and calculating the information amount to obtain a one-dimensional characteristic variable; using the one-dimensional characteristic variable as a characteristic variable to be processed to perform information gain rate sequencing, principal component analysis processing and information quantity calculation to obtain a screening characteristic variable; then, cross combination is carried out to obtain characteristic combination variables; if the information gain rate does not meet the preset condition, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed, and obtaining the feature combination variable through information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination; and if the preset conditions are met, taking the characteristic combination variable as a target combination variable of the business data. By adopting the method, the prediction capability and accuracy of the service model can be improved.

Description

Service data processing method and device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technology application, and in particular, to a method and an apparatus for processing service data, a computer device, and a storage medium.
Background
With the development of financial business modeling technology, in actual business, a few to dozens of unequal basic variables are possessed, and since most of the variables have no actual meaning, such as user addresses, the variables are classified variables with multiple attribute values, which are not beneficial to being directly used for modeling.
In the related technology, in order to solve the problem of variables without actual meanings in the business, a feature derivation technology of feature engineering is adopted, and the technology can have a strong information value after the variables without the actual meanings are changed or combined to a certain extent, so that the technology can be applied to business modeling.
However, when feature derivation technology of feature engineering is adopted to perform feature derivation on feature variables of a certain amount of business data, a plurality of feature variables are cross-combined, resulting in explosion of the number of combinations and low derivation efficiency.
Disclosure of Invention
In view of the foregoing, it is necessary to provide a business data processing method, apparatus, computer device and storage medium.
A method for processing service data comprises the following steps:
acquiring a characteristic variable of the business data from a database of the business data; obtaining discrete characteristic variables according to the characteristic variables; sorting, principal component analysis processing and information quantity calculation are carried out on the information gain rate of the discrete characteristic variable to obtain a one-dimensional characteristic variable; taking the one-dimensional characteristic variable as a characteristic variable to be processed, and performing sorting, principal component analysis processing and information quantity calculation on the information gain rate of the characteristic variable to be processed to obtain a screening characteristic variable; cross-combining the screened characteristic variables to obtain characteristic combination variables; if the information gain rate of the feature combination variable does not meet the preset condition, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed, and performing information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new feature variables to be processed to obtain the feature combination variable; and if the information gain rate of the characteristic combination variable meets a preset condition, taking the characteristic combination variable as a target combination variable of the service data.
In one embodiment, deriving the discrete characteristic variable from the characteristic variable includes:
and when the characteristic variable is a continuous characteristic variable, converting the continuous characteristic variable into a discrete characteristic variable.
In one embodiment, the obtaining a one-dimensional characteristic variable by sorting, principal component analysis processing, and information amount calculation of the information gain rate of the discrete characteristic variable includes:
inputting the discrete characteristic variable into a first decision tree to calculate the information gain rate of the discrete characteristic variable, and sequencing the information gain rate to obtain a first sequencing characteristic variable; performing principal component analysis processing on the discrete characteristic variable to obtain a first key characteristic variable; and screening the first sequencing feature variable according to the first key feature variable and calculating the information quantity to obtain a one-dimensional feature variable, wherein the dimension of the one-dimensional feature variable is consistent with the depth of the first decision tree.
In one embodiment, the screening and information amount calculation of the first ranking feature variable according to the first key feature variable to obtain a one-dimensional feature variable includes:
obtaining feature variables ranked before the first key feature variable from the first ranking feature variables to obtain first feature variables, wherein the first feature variables comprise the first key feature variables; obtaining the remaining characteristic variables except the first characteristic variable in the first sequence characteristic variables, wherein the ranking of the remaining characteristic variables is behind the first key characteristic variable; obtaining a second characteristic variable from the remaining characteristic variables; and taking the first characteristic variable and the second characteristic variable as one-dimensional characteristic variables.
In one embodiment, obtaining the second feature variable from the remaining feature variables includes:
and comparing the information quantity of the residual characteristic variable with an information quantity threshold value, and taking the residual characteristic variable with the information quantity larger than the information quantity threshold value as a second characteristic variable.
In one embodiment, the step of obtaining the screening feature variable by using the one-dimensional feature variable as the feature variable to be processed and performing sorting, principal component analysis processing and information amount calculation on the information gain rate of the feature variable to be processed includes:
taking the one-dimensional characteristic variable as a characteristic variable to be processed, inputting the characteristic variable to be processed into a second decision tree to calculate the information gain rate of the characteristic variable to be processed, and sequencing the information gain rate to obtain a second sequencing characteristic variable; performing principal component analysis processing on the characteristic variable to be processed to obtain a second key characteristic variable; and screening the second sorting characteristic variables and calculating the information quantity according to the second key characteristic variables to obtain screened characteristic variables, wherein the depth of the second decision tree is consistent with the dimension of the screened characteristic variables.
In one embodiment, the predetermined condition is that the information gain ratio is zero; if the information gain rate of the feature combination variable does not meet the preset condition, the feature combination variable and the one-dimensional feature variable are used as new feature variables to be processed, and the new feature variables to be processed are subjected to information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination to obtain the feature combination variable, wherein the method comprises the following steps of:
if the information gain rate of the feature combination variable is not zero, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed; increasing the depth of the last decision tree by one depth unit to obtain the depth of the current decision tree, and inputting the new feature variable to be processed into the current decision tree to calculate the information gain rate of the new feature variable to be processed; and carrying out information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new feature variable to be processed to obtain a feature combined variable, wherein the dimension of the feature combined variable is consistent with the depth of the current decision tree.
A service data processing apparatus, the apparatus comprising:
the first characteristic acquisition module is used for acquiring characteristic variables of the business data from a database where the business data are located;
the second characteristic acquisition module is used for obtaining discrete characteristic variables according to the characteristic variables;
the first processing module is used for sequencing the information gain rate of the discrete characteristic variable, analyzing and processing the principal component and calculating the information quantity to obtain a one-dimensional characteristic variable;
the second processing module is used for taking the one-dimensional characteristic variable as a characteristic variable to be processed, and performing sequencing, principal component analysis processing and information quantity calculation on the information gain rate of the characteristic variable to be processed to obtain a screening characteristic variable;
the combination module is used for carrying out cross combination on the screened characteristic variables to obtain characteristic combination variables;
the second processing module is further configured to, if the information gain rate of the feature combination variable does not satisfy the preset condition, use the feature combination variable and the one-dimensional feature variable as a new feature variable to be processed, and perform information gain rate sorting, principal component analysis processing, information amount calculation, and cross-combination on the new feature variable to be processed to obtain a feature combination variable;
and the judging module is used for taking the characteristic combination variable as a target combination variable of the service data if the information gain rate of the characteristic combination variable meets a preset condition.
In one embodiment, the discrete characteristic variable is obtained according to the characteristic variable, and the second characteristic obtaining module is specifically configured to:
and when the characteristic variable is a continuous characteristic variable, converting the continuous characteristic variable into a discrete characteristic variable.
In one embodiment, the information gain ratios of the discrete characteristic variables are sorted, subjected to principal component analysis processing, and subjected to information amount calculation to obtain one-dimensional characteristic variables, and the first processing module is specifically configured to:
inputting the discrete characteristic variable into a first decision tree to calculate the information gain rate of the discrete characteristic variable, and sequencing the information gain rate to obtain a first sequencing characteristic variable; performing principal component analysis processing on the discrete characteristic variable to obtain a first key characteristic variable; and screening the first sequencing feature variable according to the first key feature variable and calculating the information quantity to obtain a one-dimensional feature variable, wherein the dimension of the one-dimensional feature variable is consistent with the depth of the first decision tree.
In one embodiment, the first processing module is specifically configured to perform screening and information amount calculation on the first sorted characteristic variables according to the first key characteristic variable to obtain one-dimensional characteristic variables, and is specifically configured to:
obtaining a feature variable ranked before the first key feature variable from the first ranking feature variable to obtain a first feature variable, wherein the first feature variable comprises the first key feature variable; acquiring the remaining characteristic variables except the first characteristic variable in the first sequence characteristic variables, wherein the ranking of the remaining characteristic variables is behind the first key characteristic variable; obtaining a second characteristic variable from the remaining characteristic variables; and taking the first characteristic variable and the second characteristic variable as one-dimensional characteristic variables.
In one embodiment, the first processing module is specifically configured to obtain a first feature variable from the remaining feature variables:
and comparing the information quantity of the residual characteristic variable with an information quantity threshold value, and taking the residual characteristic variable with the information quantity larger than the information quantity threshold value as a second characteristic variable.
In one embodiment, the one-dimensional characteristic variable is used as a characteristic variable to be processed, the information gain rates of the characteristic variables to be processed are sorted, principal component analysis processing is performed, and information amount calculation is performed to obtain a screening characteristic variable, and the second processing module is specifically configured to:
taking the one-dimensional characteristic variable as a characteristic variable to be processed, inputting the characteristic variable to be processed into a second decision tree to calculate the information gain rate of the characteristic variable to be processed, and sequencing the information gain rate to obtain a second sequencing characteristic variable; performing principal component analysis processing on the characteristic variable to be processed to obtain a second key characteristic variable; and screening the second sorting characteristic variable according to the second key characteristic variable and calculating the information quantity to obtain a screened characteristic variable, wherein the depth of the second decision tree is consistent with the dimension of the screened characteristic variable.
In one embodiment, the predetermined condition is that the information gain rate is zero; if the information gain rate of the feature combination variable does not meet the preset condition, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed, and performing information gain rate sorting, principal component analysis processing, information quantity calculation and cross combination on the new feature variables to be processed to obtain the feature combination variable, wherein the second processing module is specifically configured to:
if the information gain rate of the feature combination variable is not zero, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed; increasing the depth of the last decision tree by one depth unit to obtain the depth of the current decision tree, and inputting the new feature variable to be processed into the current decision tree to calculate the information gain rate of the new feature variable to be processed; and carrying out information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new characteristic variable to be processed to obtain a characteristic combination variable, wherein the dimension of the characteristic combination variable is consistent with the depth of the current decision tree.
A computer device comprising a memory storing a computer program and a processor implementing a business data processing method as described in any one of the above when executed.
A computer-readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, realizes the service data processing method as set forth in any one of the above.
According to the business data processing method, the business data processing device, the computer equipment and the storage medium, the characteristic variable of the business data is obtained from the database where the business data is located; obtaining discrete characteristic variables according to the characteristic variables; sorting, principal component analysis processing and information quantity calculation are carried out on the information gain rate of the discrete characteristic variable to obtain a one-dimensional characteristic variable; taking the one-dimensional characteristic variable as a characteristic variable to be processed, and performing sequencing, principal component analysis processing and information quantity calculation on the information gain rate of the characteristic variable to be processed to obtain a screening characteristic variable; cross combination is carried out through the screened characteristic variables to obtain characteristic combination variables; if the information gain rate of the feature combination variable does not meet the preset condition, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed, and performing information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new feature variables to be processed to obtain the feature combination variable; and if the information gain rate of the characteristic combination variable meets a preset condition, taking the characteristic combination variable as a target combination variable of the service data. Therefore, derivative variables with low prediction value can be greatly reduced, the derivative efficiency is improved, and the function and effect of cross-derivative of features can be increased, so that the prediction capability and accuracy of a business model are improved.
Drawings
Fig. 1 is a schematic flow chart of a service data processing method in an embodiment;
FIG. 2 is a schematic flow chart of the step of obtaining one-dimensional feature variables in one embodiment;
FIG. 3 is a schematic flow chart of the step of obtaining one-dimensional feature variables in another embodiment;
FIG. 4 is a schematic flow chart of the step of obtaining filtered characteristic variables in one embodiment;
FIG. 5 is a flowchart illustrating the step of obtaining feature combination variables in one embodiment;
FIG. 6 is a block diagram of a business data processing apparatus in one embodiment;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of and not restrictive on the broad application.
The business data processing method provided by the application can be applied to an application environment for establishing a financial product marketing response model. The model is a binary model that yields the result of whether the customer will purchase the product, where a response indicates that the product is purchased, generally indicated by 1, and an unresponsive indicates that the product will not be purchased, generally indicated by 0. The probability of purchasing the financial product by the client can be predicted by constructing a financial product marketing model, and targeted marketing activities or methods are carried out on the client according to the obtained probability to enable more clients to purchase the product. In the modeling process, the characteristic part comprises collecting data, describing data, exploring data, verifying data quality and preparing data, the modeling part comprises selecting modeling times, generating test design, constructing a model and evaluating the model, and the model verifying part comprises offline service verification, feedback and iteration.
The business data is mainly divided into four major characteristics, namely client basic information, marketing feedback information, macro data and past marketing characteristics. The basic information characteristics of the customers comprise age, occupation, academic history, marital conditions, default conditions, house loan and personal loan, the marketing feedback information characteristics comprise contact ways, months of contact time, weeks of contact time and call duration, the macroscopic data characteristics comprise employment change rate, resident consumption price index, consumer confidence index, inter-bank trade interest rate and employment number, and the past marketing characteristics comprise marketing contact times, previous marketing intervals, past marketing times, success or failure of past marketing and the like.
Data processing is carried out before modeling, for example, type variables are subjected to one-hot coding to adapt to algorithm requirements, a training set and a verification set can be constructed by adopting hierarchical sampling aiming at the condition of unbalanced samples, and the unbalanced condition of the samples is relieved by sampling positive samples by the SMOTENC method, so that the capability of a model for distinguishing the positive samples from the negative samples is improved. The service model adopts an LR (Logistic Regression) model, obtains an evaluation result according to the model, and records the evaluation result, where the model may also adopt other models to construct, for example, a linear Regression, an SVM (Support Vector Machine), and the like. The collected business data can be evaluated by using an LR model, wherein the evaluation indexes include AUC (Area Under the Curve, area enclosed by coordinate axes Under the ROC Curve) and KS (Kolmogorov-Smirnov) values, for example, when the KS value is 0.32, the model has good prediction accuracy. And judging whether the over-fitting condition or the under-fitting condition exists through the evaluation index.
In one embodiment, as shown in fig. 1, a business data processing method is provided, and the method is applied to a computer device, and includes the following steps:
and 102, acquiring the characteristic variable of the business data from a database of the business data.
Specifically, the computer device processes the service data in the database to obtain a data width table required by modeling, and obtains characteristic variables of the service data in the data width table. The data wide table refers to a database table in which business theme related indexes, dimensions and attributes are associated together. For example, taking the construction of a financial product marketing response model as an example, when the financial product marketing response model is constructed, business data is obtained from a database of a financial product and a data wide table is obtained, and characteristic variables related to the financial product, such as a customer basic information characteristic, a marketing feedback information characteristic, a macroscopic data characteristic and a past marketing characteristic, are obtained from the data wide table.
And 104, obtaining discrete characteristic variables according to the characteristic variables.
The discrete characteristic variables can be valued through enumeration discrete attributes, namely the values of the discrete characteristic variables can be listed in a limited manner one by one.
Specifically, the computer device obtains the feature variables of which the service data in the data wide table are discrete feature variables. For example, when a financial product marketing model is constructed, the characteristic variable is a default condition in the basic information of a customer, and the characteristic variable has two attribute values, namely, the default condition exists and the default condition does not exist in the customer.
And 106, sequencing the information gain rate of the discrete characteristic variable, performing principal component analysis processing and calculating the information quantity to obtain a one-dimensional characteristic variable.
Wherein, for the information gain ratio, the information gain ratio of the feature a to the training data set D is defined as the ratio of the information gain to the entropy of the training data set D with respect to the value of the feature a. The information gain ratio is used in the C4.5 algorithm, which C4.5 algorithm is an extension and optimization of the ID3 algorithm. In the classification problem of the decision tree, the information gain is to label T for a specific branch, and calculate the difference between the information entropy of the original data and the information entropy introduced with the branch standard. The information entropy formula of the information set is as follows:
Figure BDA0003035343350000081
when computing feature A, the empirical conditional entropy for data set D, H (D | A), is formulated as follows:
Figure BDA0003035343350000082
the information gain calculation formula is as follows:
Gain(D,A)=H(D)-H(D|A)
the information gain ratio calculation formula is as follows:
Figure BDA0003035343350000083
above H A (D) To train the entropy of the data set D with respect to the feature A, the calculation formula is:
Figure BDA0003035343350000084
in the above formula, the training data set is D, and | D | is the sample capacity. Is provided with K classes C K (k=1,2,……,K),|C K Is of class C K The number of samples of (2). Let feature A have n different values { a } 1 ,a 2 ,……,a n Dividing the training data set D into n subsets D according to the value of the characteristic A 1 ,D 2 ,……,D n ,|D i L is D i The number of samples of (2). Memory set D i In the class C K Set of samples of D iK ,|D iK L is D iK The number of samples. H A (D) It can also be called the eigenvalue of the characteristic A, n is the number of the eigenvalues of the characteristic A.
The Principal Component Analysis (PCA) is used for reducing loss of information contained in an original index as much as possible when characteristic variables to be analyzed are reduced, so that the purpose of comprehensively analyzing mobile phone data is achieved. Because there is a certain correlation between each variable, the method can be used for solving the problem of the existing methodThe method can consider that variables with close relation are changed into new variables as few as possible, namely n characteristic variables are mapped to k characteristic variables, wherein the k characteristic variables are brand new orthogonal characteristics (principal components) and are k characteristic variables reconstructed on the basis of the original n characteristics. For example, the input data set X = { X 1 ,x 2 ,x 3 ,……,x n And K feature variables can be obtained through a PCA algorithm, namely, each bit feature is subtracted from the respective average Value to realize an average Value, then a covariance matrix is calculated, feature values and feature vectors of covariance are calculated through SVD (Singular Value Decomposition), the largest K feature values are sorted from large to small, the K feature vectors corresponding to the K feature values are respectively used as column vectors to form a feature vector matrix, and then data are converted into a new space formed by the K feature vectors.
For the Information Value (IV) calculation, we first calculate the WOE (weight of evidence), that is, the WOE calculation formula for the ith group is as follows:
Figure BDA0003035343350000091
in the above formula
Figure BDA0003035343350000092
Is the proportion of responding clients in the group,
Figure BDA0003035343350000093
is the proportion of unresponsive clients in the group, y i Is the amount of responsive customer data in the group, n i Is the amount of unresponsive customer data in the group, y T Is the total data volume of responding clients in the group, n T Is the total amount of data in the group for unresponsive customers. After the i-th group of WOEs is obtained, IV is calculated by the following formula:
Figure BDA0003035343350000094
when the feature combinations with high prediction ability are obtained by screening the feature variables by the information gain ratio calculation method, the principal component analysis calculation method, and the information amount calculation method, the weight ratio of the result obtained by the principal component analysis processing in the screening process is the largest.
Specifically, information gain rate calculation is carried out on discrete characteristic variables acquired through a data wide table, then information gain rate sorting is carried out, PCA processing calculation is carried out on the discrete characteristic variables, then IV calculation is carried out according to information gain rate sorting results and PCA results, and one-dimensional characteristic variables are obtained according to the information gain rate sorting results, the PCA results and the IV calculation results. For example, the sample capacity is determined to be 10000 according to business data for marketing financial products, wherein 100 characteristic variables are included in the business data, and the characteristic variables include basic information of customers (such as age, occupation, academic calendar, marital status, default condition, house loan and personal loan), marketing feedback information (such as contact way, month of contact time, week of contact time, call duration), macroscopic data employment change rate, resident consumption price index, consumer confidence index, inter-bank trade interest rate), past marketing characteristics (marketing contact number, previous marketing interval, past marketing number, past marketing success rate or failure of past marketing), and the number of responses (purchased products) and the number of non-responses (unpurchased products) in each characteristic are counted. Then calculating the information gain rate of each characteristic variable, sequencing the information gain rates obtained by calculation, obtaining a calculation result by each characteristic variable through a PCA algorithm, carrying out IV calculation by combining the information gain rate sequencing result and the PCA calculation result, and finally integrating the information gain sequencing result, the PCA result and the IV calculation result to obtain a one-dimensional characteristic variable, wherein the one-dimensional characteristic variable can be a resident consumption price index, whether past marketing succeeds, house loan, personal loan, default condition, marital condition, \\ 8230 \8230;, client age and academic calendar.
And 108, taking the one-dimensional characteristic variable as a characteristic variable to be processed, and performing sorting, principal component analysis processing and information quantity calculation on the information gain rate of the characteristic variable to be processed to obtain a screening characteristic variable.
Specifically, the one-dimensional characteristic variable is used as a characteristic variable to be processed, then information gain rate calculation is carried out on the characteristic variable to be processed, sorting is carried out according to the obtained information gain rate calculation results, PCA processing calculation is carried out on the characteristic variable to be processed, IV calculation is carried out according to the information gain rate sorting results and the PCA results, and screening characteristic variables are obtained according to the information gain rate sorting results, the PCA results and the IV calculation results. For example, the obtained one-dimensional characteristic variables are the resident consumption price index, whether past marketing is successful, house loan, personal loan, default situation, marital status, \8230, 8230, the client age and the academic calendar (containing 50), the information gain rates of the one-dimensional characteristic variables are calculated, then the calculated information gain rates are ranked, PCA calculation is carried out on the 50 one-dimensional characteristic variables to obtain a PCA calculation result, IV calculation is carried out by combining the information gain rate ranking result and the PCA calculation result, and finally screening characteristic variables are obtained by combining the information gain ranking result, the PCA calculation result and the IV calculation result, wherein the screening characteristic variables can be the resident consumption price index, whether past marketing is successful, house loan, \8230, 8230and the consumer confidence index.
And 110, performing cross combination on the screened characteristic variables to obtain characteristic combination variables.
The cross combination can be methods of crossing, combining, complementing, cartesian product operation, violent crossing and the like among the characteristic variables.
Specifically, the feature combination variables are obtained by cross-combining the screened feature variables and the screened feature variables. For example, there are 10 screening feature variables, namely X 1 、X 2 、X 3 、……、X 10 The characteristic variables respectively corresponding to the residential consumption price index, whether the past marketing is successful, house credit, \8230, 8230, and the consumer confidence index are X 1 ,X 2 ]、[X 1 ,X 3 ]、……、[X 9 ,X 10 ]That is, the index of the residential consumption price and whether the past marketing succeeds or not are determined as a characteristic combination variable through cross combination, the residential consumption priceThe index and the house credit are determined to be a characteristic combination variable through cross combination, and the like, and at the moment, the obtained characteristic combination variable is a two-dimensional combination characteristic variable.
And step 112, if the information gain rate of the feature combination variable does not meet the preset condition, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed, and performing information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new feature variables to be processed to obtain the feature combination variable.
Specifically, the information gain rate of the obtained feature combination variable is calculated, when the calculated information gain rate does not meet the preset condition, the feature combination variable and the one-dimensional feature variable are used as new feature variables to be processed, then the information gain rates of the new feature variables to be processed are ranked, PCA processing calculation is performed, IV calculation is performed according to the ranking results of the information gain rates and the PCA result, and the feature combination variable is obtained according to the ranking results of the information gain rates, the PCA result and the IV calculation result 1 ,X 2 ]、[X 1 ,X 3 ]、……、[X 9 ,X 10 ]Such as [ X ] 1 ,X 2 ]Determining the residential consumption price index and whether the past marketing is successful to be a feature combination variable through cross combination, calculating an information gain rate for each feature combination variable, when the information gain rate obtained through calculation does not meet a preset condition, taking the feature combination variable and the one-dimensional feature combination variable as new feature variables to be processed, sorting the information gain rates of the new feature variables to be processed, carrying out PCA processing calculation, carrying out IV calculation according to information gain rate sorting results and PCA results, obtaining the feature combination variable according to the information gain rate sorting results, the PCA results and the IV calculation results, and at the moment, setting the feature combination variable as [ X [ 1 ,X 2 ,X 3 ]、[X 2 ,X 3 ,X 4 ]823060, for example X 1 ,X 2 ,X 3 ]Obtained by cross-combining the residential consumption price index, whether past marketing is successful and house loan).
And step 114, if the information gain ratio of the feature combination variable meets a preset condition, taking the feature combination variable as a target combination variable of the service data.
Specifically, the computer device calculates an information gain ratio of the obtained feature combination variable, and when the calculated information gain ratio satisfies a preset condition, takes the obtained feature combination variable as a target combination variable of the traffic data, for example, calculates a feature combination variable [ X [ ] 1 ,X 2 ]、[X 1 ,X 3 ]、……、[X 9 ,X 10 ]When the information gain ratio meets the preset condition, the combined characteristic variables are used as target combined variables of the constructed financial product marketing model, namely target derivative variables, the target derivative variables are obtained according to the cross combination of the two one-dimensional characteristic variables, then the target combined variables are input into an LR model to obtain a model evaluation result, and the evaluation result is greatly improved.
In the service data processing method, the characteristic variable of the service data is obtained from the database where the service data is located; obtaining discrete characteristic variables according to the characteristic variables; sorting, principal component analysis processing and information quantity calculation are carried out on the information gain rate of the discrete characteristic variable to obtain a one-dimensional characteristic variable; taking the one-dimensional characteristic variable as a characteristic variable to be processed, and performing sequencing, principal component analysis processing and information quantity calculation on the information gain rate of the characteristic variable to be processed to obtain a screening characteristic variable; cross-combining the screened characteristic variables to obtain characteristic combination variables; if the information gain rate of the feature combination variable does not meet the preset condition, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed, and performing information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new feature variables to be processed to obtain the feature combination variable; and if the information gain rate of the characteristic combination variable meets a preset condition, taking the characteristic combination variable as a target combination variable of the service data. Therefore, derivative variables with low prediction value can be greatly reduced, the derivative efficiency is improved, and the function and effect of cross-derivative of features can be increased, so that the prediction capability and accuracy of a business model are improved.
In one embodiment, when a model of recommending products by an e-commerce platform according to user preferences is constructed, feature variables of business data, such as sex of different users, age of different users, occupation of different users, time of clicking on the e-commerce platform, time period spent on browsing the e-commerce platform, click rate, page sharing rate, page commodity basic information, customer purchase history, resident consumption price index and the like, are obtained from a database of the e-commerce platform, and then the obtained continuous feature variables are discretized to obtain discrete feature variables, such as discretizing the click rate of the continuous feature variables into three categories of discrete feature variables, such as a first click rate feature variable, a second click rate feature variable and a third click rate feature variable, wherein the first click rate feature variable is a feature variable with a click rate greater than or equal to a first click rate threshold, namely a high-frequency click rate feature variable; the second click rate characteristic variable is a characteristic variable of which the click rate is greater than a second click rate threshold and less than or equal to a first click rate threshold, namely an intermediate frequency click rate characteristic variable; the third click rate characteristic variable is a characteristic variable with click rate larger than the third click rate threshold and smaller than or equal to the first click rate threshold, namely a low-frequency click rate characteristic variable, then the information gain rates of all discrete characteristic variables are ranked, principal component analysis processing and information amount calculation are carried out through a decision tree with depth of 1 to obtain one-dimensional characteristic variables, wherein 100 one-dimensional characteristic variables are obtained and respectively used for a client to purchase A-type product history, A-type product click rate, B-type page sharing rate, 8230, user gender and the moment of clicking the E-commerce platform, then the one-dimensional characteristic variables are used as characteristic variables to be processed, and then the information gain rate ranking, principal component analysis processing and information amount calculation are carried out through a decision tree with depth of 2 to obtain screening characteristic variables, wherein 20 screening characteristic variables are obtained, which can be client A-type product history, A-type product click rate, 8230, consumption price index, and then cross-combining the obtained screening characteristic variables to obtain characteristic combination variables, wherein the characteristic combination is the cross-combination among the screening characteristic variables, for example, the cross-combination of the history of the A-type products and the click rate of the A-type products is used for obtaining the characteristic combination variables, if the information gain rate of the characteristic combination variables does not meet the preset conditions, the characteristic combination variables and the one-dimensional characteristic variables are used as new characteristic variables to be processed, the depth of the decision tree is increased by one depth unit at the last time to obtain the depth of the decision tree, the new characteristic variables to be processed are input into the current decision tree to be processed to carry out information gain rate calculation, principal component analysis processing and information quantity calculation to obtain the characteristic combination variables, and when the information gain rate of the characteristic combination variables meets the preset conditions, the combination characteristic variables are the target combination variables of the service data of the electricity provider platform used by the user, after the target combination variable is input into the constructed model, the effect and effect of increasing the cross derivation of the features can be improved, so that the prediction capability and accuracy of the model of recommending the product by the e-commerce platform according to the preference of the user are improved.
In one embodiment, when a certain video APP (Application) model is constructed to predict a client preference recommendation video, feature variables of video attributes may be obtained from a database of the APP, where the video attributes may include user gender, user age, user occupation, user preference, online time of each time the APP is used by a user, number of times the user uses the video weekly, number of times the video is commented on, number of times the video is played, number of times the video is clicked, number of times the video is shared, and the like, where the video includes different types of videos, which may be news, sports, and the like, and then the continuous feature variables are converted into discrete feature variables, such as the number of times the video is played, and the feature variables are divided into three categories of discrete feature variables, such as a first play feature variable, a second play feature variable, and a third play feature variable, where the first play feature variable is a feature variable whose play number is greater than or equal to a first play threshold, that is a high-frequency play feature variable; the second playing characteristic variable is a characteristic variable of which the playing times are greater than a second playing threshold and less than or equal to a first playing threshold, namely an intermediate frequency playing characteristic variable; the third playing characteristic variable is a characteristic variable with the playing frequency greater than the third playing threshold and less than the second playing threshold, that is, a low-frequency playing characteristic variable. Then, the decision tree with the depth of 1 is used for carrying out information gain rate calculation and sequencing, principal component analysis processing and information quantity calculation on all discrete characteristic variables to obtain one-dimensional characteristic variables, namely 30 one-dimensional characteristic variables of a first playing characteristic variable, a sharing sports characteristic variable, the frequency of using videos by a user every week, the frequency of commenting on events by the user, sex, occupation and the like, then the one-dimensional characteristic variable is used as a characteristic variable to be processed through a decision tree with the depth of 2, information gain rate sequencing, principal component analysis processing and information quantity calculation are carried out to obtain a screening characteristic variable, wherein, the obtained 15 screening characteristic variables can be first playing characteristic variables, sharing sports characteristic variables, \8230 \ 8230;, users sharing news characteristic variables and user occupation, then the screening characteristic variables are combined in a cross way to obtain combined characteristic variables, if the first playing characteristic variable and the shared sports characteristic variable are combined in a cross mode to obtain a combined characteristic variable, if the information gain rate of the characteristic combined variable does not meet the preset condition, taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed, adding a depth unit to the depth of the decision tree at the last time to obtain the depth of the current decision tree, inputting the new feature variables to be processed into the current decision tree to perform information gain rate calculation, principal component analysis processing and information quantity calculation to obtain the feature combination variable, when the information gain rate of the feature combination variable meets a preset condition, the combined characteristic variable is a target combined variable of service data of a video watched by a certain video APP client, the target combined variable is input into the constructed model, the effect and effect of increasing characteristic cross derivation can be improved, therefore, the prediction capability and accuracy of the video APP prediction client preference recommendation video model are improved.
In one embodiment, deriving a discrete feature variable from the feature variable comprises: and when the characteristic variable is a continuous characteristic variable, converting the continuous characteristic variable into a discrete characteristic variable. Wherein, the continuous characteristic variable can be infinitely valued, and when a financing product marketing response model is established, an LR model is adopted for establishment. When the business data in the database is processed, the continuous characteristic variables (age, call duration, the number of marketing contact times, the marketing interval of the previous time, the past marketing times, employment change rate, resident consumption price index, consumer confidence index, employment interest rate between banks and employment persons all belong to numerical types) can be converted into discrete characteristic variables through binning. The discretization can be realized by a supervision method for creating boxes or intervals by using target information, and the supervision method for decision tree discretization is characterized in that a feature variable to be discretized is used as a single variable and is put into a decision tree model to be fitted with the target variable, so that the information entropy is used as a judgment index to select the optimal feature for division, and the finally returned probability value is used as a box classification category. For example, the continuous characteristic variable employment change rate is divided into three intervals through decision tree binning, namely, a first employment change rate characteristic variable, a second employment change rate characteristic variable and a third employment change rate characteristic variable, wherein the first employment change rate characteristic variable is a characteristic variable of which the employment change rate is greater than or equal to a first employment change rate threshold value, namely, a high employment change rate characteristic variable; the second employment change rate characteristic variable is a characteristic variable of which the employment change rate is greater than the second employment change rate threshold and less than or equal to the first employment change rate threshold, namely a medium employment change rate characteristic variable; the third employment change rate characteristic variable is a characteristic variable of which the employment change rate is greater than the third employment change rate threshold and less than or equal to the second employment change rate threshold, namely, a low employment change rate characteristic variable, thereby converting the continuous characteristic variable into a discrete characteristic variable. Wherein decomposing using the decision tree comprises using the decision tree to identify an optimal partitioning point at which a bin or contiguous interval is to be determined: the finite depth decision tree is trained through discrete variables to predict the target, and then the original variable values are replaced with tree return probabilities, which are the same for all observations within a single interval, so using probability replacement is equivalent to grouping observations within the cut-off determined by the decision tree.
In the embodiment, the continuous characteristic variables are converted into the discrete characteristic variables, so that the characteristic combination and calculation can be better performed, and the iteration of the model is easy.
In one embodiment, as shown in fig. 2, sorting, principal component analysis processing, and information amount calculation are performed on the information gain rates of the discrete feature variables to obtain one-dimensional feature variables, including:
step 202, inputting the discrete characteristic variable into a first decision tree to calculate the information gain rate of the discrete characteristic variable, and sorting the information gain rate to obtain a first sorted characteristic variable.
Wherein the decision tree is a tree structure, each internal node of which represents a test on an attribute, each branch represents a test output, and each leaf node represents a category. The high-quality characteristics can be screened out by dividing the decision tree with a certain depth for corresponding times, and in the process of constructing the tree model through the decision tree, the characteristics of sub-paths above leaf nodes are selected for calculation to obtain corresponding characteristic variables. And evaluating the risk of the project and judging the feasibility of the project according to the decision tree.
Specifically, the information gain rate of the discrete characteristic variables is calculated by inputting the discrete characteristic variables into a decision tree with the depth of 1, 1-time characteristic division is carried out according to the information gain rate, and then a first-order characteristic variable is obtained according to the information gain rate in an ordering mode. For example, after continuous characteristic variables in 100 characteristic variables such as age, occupation and academic calendar in a marketing response model of a financial product are converted into discrete characteristic variables, information gain rate calculation is carried out on the 100 discrete characteristic variables, and the information gain rates are sequenced to obtain a first sequencing characteristic variable, namely X 1 、X 2 、X 3 、……、X 100 The corresponding characteristic variables are the resident consumption price index, whether the past marketing is successful, house credit, \8230, 8230, and the contact way.
And 204, performing principal component analysis processing on the discrete characteristic variable to obtain a first key characteristic variable.
Specifically, principal component analysis and calculation processing is carried out on discrete characteristic variables in the business model to obtain first key characteristic variables. For example, the method comprises the step of carrying out principal component analysis calculation processing on 100 discrete characteristic variables in a marketing response model of a financial product to obtain 9 first key characteristic variables, namely X 1 、X 3 、X 4 、X 7 、X 8 、X 13 、X 25 、X 28 、X 47 Respectively correspond to the residents to eliminateThe price index, the house loan, the personal loan, the employment change rate and the like, wherein the principal component analysis is equivalent to the reduction of characteristic variables, but the principal component analysis can obtain the characteristics covering most characteristic information, and the problem of sparse characteristics can be avoided.
Step 206, the first sorting feature variable is filtered and information amount is calculated according to the first key feature variable to obtain a one-dimensional feature variable, and the dimension of the one-dimensional feature variable is consistent with the depth of the first decision tree.
Specifically, the first sequence feature variables are screened according to first key feature variables obtained by performing principal component analysis, and information quantity calculation is performed according to the screened feature variables to obtain one-dimensional feature variables, wherein the dimensionality of the one-dimensional feature variables is consistent with the depth of the first decision tree. For example, business data based on financial products is combined with the above steps to obtain a first ranking characteristic variable (i.e. X) 1 、X 2 、X 3 、……、X 100 ) And a first key feature variable (i.e., X) 1 、X 3 、X 4 、X 7 、X 8 、X 13 、X 25 、X 28 、X 47 ) According to the preceding description, the first sequencing characteristic variable is a resident consumption price index, whether past marketing is successful or not, house loan 8230, a contact way, the first key characteristic variable is a resident consumption price index, house loan, personal loan, employment change rate, occupation and the like, and the first sequencing characteristic variable is screened and information amount is calculated according to the first key characteristic variable to obtain 50 one-dimensional characteristic variables, namely X 1 、X 2 、X 3 、……、X 47 、X 53 、X 49 、X 70 . Therefore, the feature variables are reduced by screening and information quantity calculation according to the first key feature variable and the first sequencing feature variable to obtain the one-dimensional feature variable, so that the running time of subsequent cross combination can be greatly reduced, and the combination efficiency is improved.
In this embodiment, the information gain rates of the discrete feature variables are calculated and ranked through the first decision tree to obtain first ranked feature variables, and then the first key feature variables are obtained through principal component analysis calculation processing, wherein the number of the first key feature variables after principal component analysis processing is greatly reduced, so that the problem of feature sparseness can be avoided, and then the first ranked feature variables are screened according to the first key feature variables with reduced number but capable of covering most feature information, and the information amount is calculated to obtain one-dimensional feature variables, so that the running time of subsequent cross-combination can be reduced, and the efficiency of cross-combination is improved.
In one embodiment, as shown in fig. 3, the screening and information amount calculation of the first ranking feature variable according to the first key feature variable to obtain a one-dimensional feature variable includes:
step 302, obtaining feature variables ranked before the first key feature variable from the first ranking feature variables to obtain a first feature variable, where the first feature variable includes the first key feature variable.
Specifically, feature variables ranked before a first key feature variable are reserved from the first ranking feature variables according to the sequence in the first ranking feature variables to obtain the first feature variables, wherein the first feature variables comprise the first key feature variables. For example, information gain calculation is carried out on discrete characteristic variables based on business data of a financial product, and the discrete characteristic variables are sequenced to obtain a first sequencing characteristic variable, namely X 1 、X 2 、X 3 、……、X 100 And obtaining the first key characteristic variable according to the principal component analysis, wherein the first key characteristic variable is a resident consumption price index, a house loan, a personal loan, a employment change rate, a profession and the like (namely X) 1 、X 3 、X 4 、X 7 、X 8 、X 13 、X 25 、X 28 、X 47 ) Then, based on the first key feature variable, the same 9 feature variables as the first key feature variable in the first ranking feature variable are reserved, and the feature variables ranked before the first key feature variable are obtained, that is, the first key feature variable is reserved as the marketing times feature variable, that is, X is reserved 47 All the previous characteristic variables, and finally obtaining the first characteristic variable, namely X 1 、X 2 、X 3 、X 4 、X 5 、X 6 、X 7 ……、X 47
And step 304, obtaining the remaining feature variables except the first feature variable in the first ranking feature variables, wherein the ranking of the remaining feature variables is after the first key feature variable.
Specifically, according to a first ranking characteristic variable and the acquired first characteristic variable, acquiring the remaining characteristic variables except the first characteristic variable in the first ranking characteristic variable, wherein the remaining characteristic variables are ranked after the first key characteristic variable. For example, a first ranking characteristic variable (i.e., X) obtained based on business data of a financial product 1 、X 2 、X 3 、……、X 100 ) And the first characteristic variable (i.e. X) obtained 1 、X 2 、X 3 、X 4 、X 5 、X 6 、X 7 ……、X 47 ) Obtaining a residual characteristic variable, i.e. X 48 、X 49 、X 50 、……、X 100
In step 306, a second feature variable is obtained from the remaining feature variables.
Specifically, the second characteristic variable is acquired according to the acquired remaining characteristic variables. Residual characteristic variables, i.e. X, for example based on the acquisition of business data of financial products 48 、X 49 、X 50 、……、X 100 Obtaining a second characteristic variable from the remaining characteristic variables, the second characteristic variable being the number of past marketing, the age of the customer, the academic history, i.e. X 53 、X 49 、X 70
And step 308, taking the first characteristic variable and the second characteristic variable as one-dimensional characteristic variables.
Specifically, according to the first characteristic variable and the second characteristic variable obtained, the first characteristic variable and the second characteristic variable are combined as one-dimensional characteristic variables. For example, a first characteristic variable (i.e., X) is obtained based on business data of a financial product 1 、X 2 、X 3 、X 4 、X 5 、X 6 、X 7 ……、X 47 ) Andthe second characteristic variable (i.e. X) 53 、X 49 、X 70 ) The first characteristic variable and the second characteristic variable are used as one-dimensional characteristic variables, and the one-dimensional characteristic variables are the residential consumption price index, whether past marketing is successful, house loan, personal loan, default, marital status, \8230 \ 8230;, customer age, and academic calendar, namely X, based on the characteristic variables referred to in the foregoing description 1 、X 2 、X 3 、……、X 47 、X 53 、X 49 、X 70
In this embodiment, feature variables ranked before a first key feature variable are obtained from a first ranking feature variable to obtain a first feature variable, then remaining feature variables excluding the first feature variable in the first ranking feature variable are obtained, the remaining feature variables are ranked after the first key feature variable, then a second feature variable is obtained from the remaining variables, and finally the first feature variable and the second feature variable are used as one-dimensional feature variables. Therefore, high-quality one-dimensional characteristic variables can be obtained, all characteristic combinations with strong prediction capability can be conveniently and subsequently combined in a cross mode, and the prediction capability and accuracy of the service model are improved.
In one embodiment, obtaining a second feature variable from the remaining feature variables comprises: and comparing the information quantity of the residual characteristic variable with an information quantity threshold value, and taking the residual characteristic variable with the information quantity larger than the information quantity threshold value as a second characteristic variable. The information quantity (IV value) is a measure of the predictive ability of a certain variable, as shown in table 1:
TABLE 1 relationship between information magnitude intervals and prediction capabilities
Figure BDA0003035343350000181
As shown in the relationship between the information amount and the prediction ability in Table 1, when the information amount is less than or equal to 0.02, the characteristic variable has no prediction ability, the information amount needs to be discarded, and when the information gain rate of the characteristic variable is ranked later, the information amount is 0.3 to 0.02Between 0.5, feature variables with predictive power between 0.3 and 0.5 are emphasized. For example, the remaining characteristic variable, namely X, of the business data acquisition based on financial products 48 、X 49 、X 50 、……、X 100 Calculating the information quantity of the residual variable according to a calculation formula of the information quantity, comparing the information quantity of the residual characteristic variable with an information quantity threshold value of 0.02, and if the information quantity of the residual variable obtained by calculation is larger than the threshold value, namely X is the characteristic variable of the past marketing frequency, the client age and the academic calendar 53 、X 49 、X 70 Is greater than 0.02, although the characteristic variable X 53 、X 49 、X 70 The information gain rates of (1) are ranked back, but the information volumes are all between 0.3 and 0.5, and X 53 IV value of greater than X 49 In the actual screening process, X with the prediction capability between 0.3 and 0.5 is reserved due to emphasis 53 、X 49 、X 70 . The information quantity of other residual characteristic variables is less than 0.02, namely, no prediction capability exists, and finally, X is added 53 、X 49 、X 70 As a second characteristic variable.
In this embodiment, the feature variables with prediction capability are obtained by comparing the information amount of the remaining variables with the information amount threshold value screening, so that the prediction capability and accuracy of the model can be improved.
In one embodiment, as shown in fig. 4, taking the one-dimensional feature variable as a feature variable to be processed, sorting the information gain rate of the feature variable to be processed, performing principal component analysis processing, and calculating the information amount to obtain a filtered feature variable, includes:
step 402, using the one-dimensional characteristic variable as a characteristic variable to be processed, inputting the characteristic variable to be processed into a second decision tree to calculate the information gain rate of the characteristic variable to be processed, and sorting the information gain rate to obtain a second sorted characteristic variable.
Specifically, the one-dimensional characteristic variable is used as a characteristic variable to be processed, the characteristic variable to be processed is input into a second decision tree with the depth of 2 to calculate the information gain rate of the characteristic variable to be processed, division is performed for 2 times according to the information gain rate, and then the information gain rate is obtainedAnd obtaining a second sorting characteristic variable by the rate sorting. For example, processing business data of financial products to obtain one-dimensional characteristic variable, namely X 1 、X 2 、X 3 、……、X 47 、X 53 、X 49 、X 70 Which contains 50 characteristic variables. The one-dimensional characteristic variable is used as a characteristic variable to be processed and input into a decision tree with the depth of 2 to calculate an information gain rate, and a second ranking characteristic variable obtained by ranking according to the information gain rate is obtained, wherein the second ranking characteristic variable can be a residential consumption price index, whether past marketing is successful, house loan, 8230, marketing times, client age, past marketing times and academic history, namely X, based on the characteristic variables indicated in the preamble 1 、X 2 、X 3 、……、X 47 、X 49 、X 53 、X 70
And step 404, performing principal component analysis processing on the characteristic variable to be processed to obtain a second key characteristic variable.
Specifically, the one-dimensional characteristic variables are used as characteristic variables to be processed to perform principal component analysis calculation processing to obtain second key characteristic variables. For example, processing business data of financial products to obtain one-dimensional characteristic variable (namely X) 1 、X 2 、X 3 、……、X 47 、X 53 、X 49 、X 70 ) Performing principal component analysis to calculate and obtain 2 second key characteristic variables, namely, reducing the number of the characteristic variables through principal component analysis processing to obtain the employment change rate and the inter-bank consummation interest rate of the second key characteristic variables, namely X 7 、X 9 The characteristic variables are reduced again through principal component analysis processing, but the characteristic variables obtained through the principal component analysis processing can contain most of the characteristics of the characteristic information, and the problem of characteristic sparseness can be avoided.
Step 406, screening the second ranking feature variable according to the second key feature variable and calculating an information amount to obtain a screened feature variable, where the depth of the second decision tree is consistent with the dimension of the screened feature variable.
Specifically, the second one is obtained by performing principal component analysisAnd screening the second sorting characteristic variables by the key characteristic variables, and calculating the information quantity according to the screened characteristic variables to obtain screened characteristic variables, wherein the dimensionality of the two-dimensional combined characteristic variables is consistent with the depth of the second decision tree. For example, a second ranking characteristic variable (i.e., X) is obtained based on business data of a financial product in combination with the above steps 1 、X 2 、X 3 、……、X 47 、X 49 、X 53 、X 70 ) And a second key feature variable (i.e., X) 7 、X 9 ) Reserving two feature variables of the second ordering feature variables which are the same as the second key feature variable, and acquiring the feature variable which is ranked before the second key feature variable, namely reserving the feature variable which is ranked at X 9 All the previous feature variables, then sorting X's in the feature variables for the second 9 Calculating the information quantity of the subsequent characteristic variables, if X 10 Is between 0.3 and 0.5, and X 10 If the subsequent information quantity is less than 0.02, X is obtained 10 Feature variables, and finally obtaining 10 screened feature variables, which are respectively the residential consumption price index, whether the past marketing is successful, house loan, 8230, 8230and consumer confidence index, namely X 1 、X 2 、X 3 、……、X 10
In this embodiment, a one-dimensional feature variable is used as a feature variable to be processed and input into a decision tree with a depth of 2 to calculate an information gain rate of the feature variable to be processed, the information gain rate is sorted to obtain a second sorted feature variable, then the feature variable to be processed is subjected to principal component analysis processing to obtain a second key feature variable, and finally the second sorted feature variable is screened according to the second key feature variable and the information amount is calculated to obtain a screened feature variable, so that a better feature variable can be obtained, a feature combination with strong prediction capability can be obtained through subsequent cross-combination, and the prediction capability and accuracy of a service model are improved.
In one embodiment, as shown in fig. 5, the preset condition is that the information gain ratio is zero; if the information gain rate of the feature combination variable does not meet the preset condition, the feature combination variable and the one-dimensional feature variable are used as new feature variables to be processed, and the new feature variables to be processed are subjected to information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination to obtain the feature combination variable, wherein the method comprises the following steps of:
step 502, if the information gain ratio of the feature combination variable is not zero, the feature combination variable and the one-dimensional feature variable are used as new feature variables to be processed.
Specifically, the information gain rate of the feature combination variable is calculated, and if the information gain of the feature combination variable is not zero, that is, the information gain of the feature combination variable does not meet a preset condition, the feature combination variable and the one-dimensional feature variable are used as new feature variables to be processed. For example, based on the business data of financial products, through pairwise cross combination (such as the consumption price index X of residents) 1 And whether past marketing was successful X 2 Cross-combining to obtain a combined variable [ X ] 1 ,X 2 ]) The obtained characteristic combination variable is [ X ] 1 ,X 2 ]、[X 1 ,X 3 ]、……、[X 9 ,X 10 ]And the one-dimensional characteristic variable has X 1 、X 2 、X 3 、……、X 47 、X 53 、X 49 、X 70 Taking the feature combination variable and the one-dimensional feature variable as new feature variables to be processed, namely [ X ] 1 ,X 2 ]、[X 1 ,X 3 ]、……、[X 9 ,X 10 ]、X 1 、X 2 、X 3 、……、X 47 、X 53 、X 49 、X 70
Step 504, adding a depth unit to the depth of the last decision tree to obtain the depth of the current decision tree, and inputting the new feature variable to be processed into the current decision tree to calculate the information gain rate of the new feature variable to be processed.
Specifically, the depth of the current decision tree is obtained by increasing the depth of the last decision tree by one depth unit, that is, the depth of the last decision tree is different from the depth of the current decision tree by 1. And inputting the new characteristic variable to be processed into the current decision tree to calculate the information gain rate of the new characteristic variable to be processed. For example, a new feature variable to be processed (i.e. [ X ]) to be obtained based on business data of a financial product 1 ,X 2 ]、[X 1 ,X 3 ]、……、[X 9 ,X 10 ]、X 1 、X 2 、X 3 、……、X 47 、X 53 、X 49 、X 70 ) The information gain ratio is calculated as input to the third decision tree.
Step 506, performing information gain rate sorting, principal component analysis processing, information quantity calculation and cross combination on the new feature variable to be processed to obtain a feature combined variable, wherein the feature combined variable dimension is consistent with the depth of the current decision tree.
Specifically, the new feature variables to be processed are input into the information gain rate calculated by the decision tree with a certain depth, and the information gain rate sorting, principal component analysis processing, information amount calculation and cross combination are performed to obtain feature combination variables with corresponding dimensions, namely the dimension of the feature combination variables is consistent with the depth of the current decision tree. For example, the new feature variable [ X ] to be processed is added 1 ,X 2 ]、[X 1 ,X 3 ]、……、[X 9 ,X 10 ]、X 1 、X 2 、X 3 、……、X 47 、X 53 、X 49 、X 70 Obtaining three-dimensional combination characteristic variables after information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination, wherein the obtained three-dimensional combination characteristic variables are [ X ] 1 ,X 2 ,X 3 ]、[X 2 ,X 3 ,X 4 ]\8230; \8230, where X is based on the meaning of the characteristic variables referred to above 1 And X 2 And X 3 Respectively representing the resident consumption price index, whether the past marketing is successful or not, house credit, and a combined characteristic variable [ X ] 1 ,X 2 ,X 3 ]Expressed as a characteristic variable X 1 And X 2 And X 3 And (4) cross-combining.
In this embodiment, when the information gain ratio of the feature combination variable is not zero, first, the feature combination variable and the one-dimensional feature variable are used as new feature variables to be processed, then, a depth unit is added to the last decision tree depth to obtain a current decision tree depth, the new feature variables to be processed are input into the current decision tree to calculate the information gain ratio of the new feature variables to be processed, and finally, information gain ratio sorting, principal component analysis processing, information quantity calculation, and cross combination or feature combination variables are performed on the new feature variables to be processed, and all feature cross combinations are sequentially iterated to derive all good-quality feature combinations, so that KS values obtained by all good-quality feature combinations obtained by derivation in the LR model are 0.43, and compared with KS values obtained by constructing a financial product response model through the LR model when no feature combinations are obtained by derivation, the good fitting effect of all good-quality feature combinations obtained by diffraction is achieved, and the value of model evaluation index can be improved, thereby improving the prediction capability and accuracy of the business model.
It should be understood that although the various steps in the flow charts of fig. 1-5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not limited to being performed in the exact order illustrated and, unless explicitly stated herein, may be performed in other orders. Moreover, at least some of the steps in fig. 1-5 may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed in turn or alternately with other steps or at least some of the other steps.
In one embodiment, as shown in fig. 6, there is provided a service data processing apparatus 600, including: a first feature acquisition module 602, a second feature acquisition module 604, a first processing module 606, a second processing module 608, a combining module 610, and a decision module 612, wherein:
the first feature obtaining module 602 is configured to obtain a feature variable of the service data from a database in which the service data is located.
A second feature obtaining module 604, configured to obtain a discrete feature variable according to the feature variable.
The first processing module 606 is configured to sequence the information gain ratios of the discrete characteristic variables, analyze principal components, and calculate information amounts to obtain one-dimensional characteristic variables.
The second processing module 608 is configured to use the one-dimensional feature variable as a feature variable to be processed, and perform sorting, principal component analysis processing, and information amount calculation on the information gain rate of the feature variable to be processed to obtain a filtered feature variable.
And the combining module 610 is used for performing cross combination on the screened feature variables to obtain feature combination variables.
The second processing module 608 is further configured to, if the information gain rate of the feature combination variable does not satisfy the preset condition, use the feature combination variable and the one-dimensional feature variable as a new feature variable to be processed, and perform information gain rate sorting, principal component analysis processing, information amount calculation, and cross-combining on the new feature variable to be processed to obtain the feature combination variable.
The determining module 612 is configured to, if the information gain ratio of the feature combination variable satisfies a preset condition, use the feature combination variable as a target combination variable of the service data.
In an embodiment, the second feature obtaining module 604 is specifically configured to convert the continuous feature variable into a discrete feature variable when the feature variable is the continuous feature variable.
In an embodiment, the first processing module 606 is specifically configured to input the discrete feature variables into a first decision tree to calculate information gain rates of the discrete feature variables, and sort the information gain rates to obtain first sorted feature variables; performing principal component analysis processing on the discrete characteristic variable to obtain a first key characteristic variable; and screening the first sequencing feature variable according to the first key feature variable and calculating the information quantity to obtain a one-dimensional feature variable, wherein the dimension of the one-dimensional feature variable is consistent with the depth of the first decision tree.
In an embodiment, the first processing module 606 is specifically configured to obtain, from the first ranking feature variables, feature variables ranked before the first key feature variable, to obtain first feature variables, where the first feature variables include the first key feature variable; obtaining the remaining characteristic variables except the first characteristic variable in the first sequence characteristic variables, wherein the ranking of the remaining characteristic variables is behind the first key characteristic variable; obtaining a second characteristic variable from the remaining characteristic variables; and taking the first characteristic variable and the second characteristic variable as one-dimensional characteristic variables.
In an embodiment, the first processing module 606 is specifically configured to compare the information amount of the remaining characteristic variable with an information amount threshold, and use the remaining characteristic variable with the information amount greater than the information amount threshold as the second characteristic variable.
In an embodiment, the second processing module 608 is specifically configured to use the one-dimensional feature variable as a feature variable to be processed, input the feature variable to be processed into a second decision tree, calculate an information gain rate of the feature variable to be processed, and rank the information gain rate to obtain a second ranked feature variable; performing principal component analysis processing on the characteristic variable to be processed to obtain a second key characteristic variable; and screening the second sorting characteristic variables and calculating the information quantity according to the second key characteristic variables to obtain screened characteristic variables, wherein the depth of the second decision tree is consistent with the dimension of the screened characteristic variables.
In an embodiment, the second processing module 608 is specifically configured to, if the information gain ratio of the feature combination variable is not zero, take the feature combination variable and the one-dimensional feature variable as new feature variables to be processed; increasing the depth of the last decision tree by one depth unit to obtain the depth of the current decision tree, and inputting the new feature variable to be processed into the current decision tree to calculate the information gain rate of the new feature variable to be processed; and carrying out information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new characteristic variable to be processed to obtain a characteristic combination variable, wherein the dimension of the characteristic combination variable is consistent with the depth of the current decision tree.
For specific limitations of the service data processing apparatus, reference may be made to the above limitations on the service data processing method, which is not described herein again. The modules in the business data processing device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing the business data processing data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a business data processing method.
It will be appreciated by those skilled in the art that the configuration shown in fig. 7 is a block diagram of only a portion of the configuration associated with the present application, and is not intended to limit the computing device to which the present application may be applied, and that a particular computing device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the above-described method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database or other medium used in the embodiments provided herein can include at least one of non-volatile and volatile memory. Non-volatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, or the like. Volatile Memory can include Random Access Memory (RAM) or external cache Memory. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is specific and detailed, but not to be understood as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (10)

1. A method for processing service data, the method comprising:
acquiring characteristic variables of the business data from a database of the business data;
obtaining discrete characteristic variables according to the characteristic variables;
sequencing the information gain rate of the discrete characteristic variables, analyzing and processing the principal components and calculating the information quantity to obtain one-dimensional characteristic variables;
taking the one-dimensional characteristic variable as a characteristic variable to be processed, and performing sequencing, principal component analysis processing and information quantity calculation on the information gain rate of the characteristic variable to be processed to obtain a screening characteristic variable;
cross-combining the screened characteristic variables to obtain characteristic combination variables;
if the information gain rate of the feature combination variables does not meet the preset condition, taking the feature combination variables and the one-dimensional feature variables as new feature variables to be processed, and performing information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new feature variables to be processed to obtain feature combination variables;
and if the information gain rate of the characteristic combination variable meets a preset condition, taking the characteristic combination variable as a target combination variable of the service data.
2. The method according to claim 1, wherein the deriving a discrete feature variable from the feature variable comprises:
and when the characteristic variable is a continuous characteristic variable, converting the continuous characteristic variable into a discrete characteristic variable.
3. The method according to claim 1, wherein the sorting, principal component analysis processing and information quantity calculation of the information gain rate of the discrete characteristic variables to obtain one-dimensional characteristic variables comprises:
inputting the discrete characteristic variables into a first decision tree to calculate the information gain rate of the discrete characteristic variables, and sequencing the information gain rate to obtain first sequencing characteristic variables;
performing principal component analysis processing on the discrete characteristic variables to obtain first key characteristic variables;
and screening the first sequencing feature variables and calculating the information quantity according to the first key feature variables to obtain one-dimensional feature variables, wherein the dimensionality of the one-dimensional feature variables is consistent with the depth of the first decision tree.
4. The method according to claim 3, wherein the screening and information amount calculation of the first ranking feature variables according to the first key feature variable to obtain one-dimensional feature variables comprises:
acquiring feature variables ranked before the first key feature variable from the first ranking feature variables to obtain first feature variables, wherein the first feature variables comprise the first key feature variables;
acquiring the remaining characteristic variables except the first characteristic variable in the first sequence characteristic variables, wherein the ranking of the remaining characteristic variables is behind the first key characteristic variable;
obtaining a second characteristic variable from the residual characteristic variables;
and taking the first characteristic variable and the second characteristic variable as one-dimensional characteristic variables.
5. The method according to claim 4, wherein the obtaining a second characteristic variable from the remaining characteristic variables comprises:
and comparing the information quantity of the residual characteristic variables with an information quantity threshold value, and taking the residual characteristic variables with the information quantity larger than the information quantity threshold value as second characteristic variables.
6. The method according to claim 1, wherein the step of obtaining the screening feature variables by using the one-dimensional feature variables as the feature variables to be processed and performing sorting, principal component analysis processing and information quantity calculation on the information gain rates of the feature variables to be processed comprises:
taking the one-dimensional characteristic variables as characteristic variables to be processed, inputting the characteristic variables to be processed into a second decision tree to calculate the information gain rate of the characteristic variables to be processed, and sequencing the information gain rate to obtain second sequencing characteristic variables;
performing principal component analysis processing on the characteristic variables to be processed to obtain second key characteristic variables;
and screening the second sorting characteristic variables and calculating the information quantity according to the second key characteristic variables to obtain screened characteristic variables, wherein the depth of the second decision tree is consistent with the dimension of the screened characteristic variables.
7. The method according to claim 1, wherein the preset condition is that the information gain rate is zero; if the information gain rate of the feature combination variable does not meet the preset condition, the feature combination variable and the one-dimensional feature variable are used as new feature variables to be processed, and the new feature variables to be processed are subjected to information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination to obtain the feature combination variable, and the method comprises the following steps of:
if the information gain rate of the characteristic combination variable is not zero, taking the characteristic combination variable and the one-dimensional characteristic variable as new characteristic variables to be processed;
increasing the depth of the last decision tree by a depth unit to obtain the depth of the current decision tree, and inputting the new feature variable to be processed into the current decision tree to calculate the information gain rate of the new feature variable to be processed;
and performing information gain rate sequencing, principal component analysis processing, information quantity calculation and cross combination on the new feature variables to be processed to obtain feature combination variables, wherein the dimension of the feature combination variables is consistent with the depth of the current decision tree.
8. A service data processing apparatus, characterized in that the apparatus comprises:
the first characteristic acquisition module is used for acquiring characteristic variables of the business data from a database where the business data are located;
the second characteristic acquisition module is used for obtaining discrete characteristic variables according to the characteristic variables;
the first processing module is used for sequencing the information gain rate of the discrete characteristic variables, analyzing and processing the principal components and calculating the information quantity to obtain one-dimensional characteristic variables;
the second processing module is used for taking the one-dimensional characteristic variables as characteristic variables to be processed, and performing sorting, principal component analysis processing and information quantity calculation on the information gain rate of the characteristic variables to be processed to obtain screening characteristic variables;
the combination module is used for carrying out cross combination on the screened characteristic variables to obtain characteristic combination variables;
the second processing module is further configured to, if the information gain rate of the feature combination variable does not satisfy a preset condition, use the feature combination variable and the one-dimensional feature variable as a new feature variable to be processed, and perform information gain rate sorting, principal component analysis processing, information amount calculation, and cross-combination on the new feature variable to be processed to obtain a feature combination variable;
and the judging module is used for taking the characteristic combination variable as a target combination variable of the service data if the information gain rate of the characteristic combination variable meets a preset condition.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 7.
10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.
CN202110441885.9A 2021-04-23 2021-04-23 Service data processing method and device, computer equipment and storage medium Pending CN115222177A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110441885.9A CN115222177A (en) 2021-04-23 2021-04-23 Service data processing method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110441885.9A CN115222177A (en) 2021-04-23 2021-04-23 Service data processing method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115222177A true CN115222177A (en) 2022-10-21

Family

ID=83606129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110441885.9A Pending CN115222177A (en) 2021-04-23 2021-04-23 Service data processing method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115222177A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544902A (en) * 2022-11-29 2022-12-30 四川骏逸富顿科技有限公司 Pharmacy risk level identification model generation method and pharmacy risk level identification method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115544902A (en) * 2022-11-29 2022-12-30 四川骏逸富顿科技有限公司 Pharmacy risk level identification model generation method and pharmacy risk level identification method

Similar Documents

Publication Publication Date Title
Kant et al. Merging user and item based collaborative filtering to alleviate data sparsity
CN111815415B (en) Commodity recommendation method, system and equipment
CN110503531B (en) Dynamic social scene recommendation method based on time sequence perception
Li et al. News recommendation via hypergraph learning: encapsulation of user behavior and news content
US11243992B2 (en) System and method for information recommendation
CN106251174A (en) Information recommendation method and device
Maldonado et al. Advanced conjoint analysis using feature selection via support vector machines
Kim et al. Recommendation system for sharing economy based on multidimensional trust model
CN113610240A (en) Method and system for performing predictions using nested machine learning models
Wang et al. A personalized electronic movie recommendation system based on support vector machine and improved particle swarm optimization
Li et al. From reputation perspective: A hybrid matrix factorization for QoS prediction in location-aware mobile service recommendation system
Wang et al. Modeling uncertainty to improve personalized recommendations via Bayesian deep learning
Choi et al. Quality evaluation and best service choice for cloud computing based on user preference and weights of attributes using the analytic network process
Chen et al. DPM-IEDA: dual probabilistic model assisted interactive estimation of distribution algorithm for personalized search
CN111178986A (en) User-commodity preference prediction method and system
CN115222177A (en) Service data processing method and device, computer equipment and storage medium
CN111475744B (en) Personalized position recommendation method based on ensemble learning
Lilhore et al. Hybrid weighted random forests method for prediction & classification of online buying customers
Guan et al. Enhanced SVD for collaborative filtering
CN116861070A (en) Recommendation model processing method, device, computer equipment and storage medium
Kilani et al. Using artificial intelligence techniques in collaborative filtering recommender systems: Survey
CN114693409A (en) Product matching method, device, computer equipment, storage medium and program product
Saha et al. A modified Brown and Gibson model for cloud service selection
Ghaznavi et al. Assessing usage of negative similarity and distrust information in CF-based recommender system
Mendikowski et al. Creating Customers That Never Existed: Synthesis of E-commerce Data Using CTGAN

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination