CN112308705A

CN112308705A - Method, equipment and medium for identifying out-of-service worker based on bank data

Info

Publication number: CN112308705A
Application number: CN202011237043.3A
Authority: CN
Inventors: 尹卓英; 龙军
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2021-02-02

Abstract

The invention discloses a method, a device and a medium for identifying out-of-service workers based on bank data, wherein the method comprises the following steps: acquiring a large amount of customer data of a bank, extracting customer features related to outgoing labels from the customer data, using the customer features as feature vectors of samples, using the outgoing labels registered in an account opening as sample labels, and constructing a weak supervision sample set; selecting a part of samples from the weak supervision sample set, and constructing a strong supervision sample set by using a result of manual verification as a sample label; constructing a classification model, and training the classification model by using a weak supervision sample set and a strong supervision sample set to obtain an out-of-service worker identification model; and (3) extracting the client characteristics from the client data of the client to be identified, inputting the client characteristics into the out-of-service worker identification model, and outputting to obtain whether the client to be identified is an out-of-service worker. According to the method, the strength and weakness supervision sets are divided, the acquisition difficulty of training samples is reduced, the diversity of the samples is ensured, the generalization performance of the recognition model is improved, and the recognition accuracy is high.

Description

Method, equipment and medium for identifying out-of-service worker based on bank data

Technical Field

The invention relates to the technical field of bank specific customer group identification, in particular to a method, equipment and medium for identifying out-of-service worker based on bank data.

Background

With the development of related technologies in the financial industry, the financial industry is undergoing tremendous changes. Commercial banking faces serious challenges with the rise of emerging financial companies such as financial financing and internet finance. Meanwhile, as the operation modes of all commercial banks are gradually improved, the difference of the service levels among banks is gradually reduced, all banks gradually become homogeneous, and relevant marketing strategies are customized according to the characteristics of the customers of the banks, so that the method is an important ring for winning in increasingly violent competition.

The Guizhou is a labor output province, and Guizhou agriculture trust makes great effort and contribution to the service of the outworkers, provides financial knowledge, enriching information and right-keeping help for the farmers, and gives the farmers the greatest care to the groups. And the farmer and civil workers are accurately and effectively identified, and the method is greatly helpful for subsequent service and accurate marketing.

In the identification of the people of the business workers, the traditional method is to use an account opening registration system to display that the outgoing labels are added when the customers contact the bank for the first time, and the accuracy of the labels is greatly reduced along with the change of time due to the flow of the people and the uncertainty of the label aging; in addition, partial labels can be corrected by telephone inquiry and return visit of service personnel, but the method is not suitable for large-scale and periodic operation due to overlarge investment cost; moreover, the method can obtain certain achievements through the expert rules for identification, but the rules are usually limited by subjective cognition, and partial rules are difficult to realize computability, so that only limited target groups can be identified, and meanwhile, the expert rules need to be updated and maintained by consuming manpower subsequently along with the development of time so as to keep the effectiveness of the expert rules.

The efficiency and the accuracy of the method in the identification of the people who go out to be serviced cannot be guaranteed. At present, a machine learning method is utilized to extract characteristic data of a client, sample data with guaranteed accuracy is obtained through combination, and a recognition rule is automatically learned through a supervised learning algorithm to recognize a target group. However, whether the rules can be accurately and effectively recognized and whether the learned rules have generalization depends on how to extract features of the client and whether an accurate training sample can be obtained. While it is the management model of most banks to divide bank customers according to properties such as assets, preferences and liabilities, commercial bank customer data is often of high dimensional complexity, so that the difficulty in selecting features will also increase. Meanwhile, the accuracy of the recorded labels during account opening cannot be guaranteed, a large number of accurate sample labels are obtained in a manual investigation mode, so that the cost is high, the obtained sample amount is too small, the learned model has large variance and low generalization, and the challenge is filled in how to obtain the samples meeting the learning algorithm requirements at the cost as low as possible.

Disclosure of Invention

The invention aims to provide a method, equipment and a medium for identifying out-of-service workers based on bank data, which can obtain an identification model of the out-of-service workers at the lowest cost and have higher accuracy and generalization.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

a method for identifying out-of-service workers based on belief learning comprises the following steps:

step 1, constructing a supervision sample set;

acquiring a large amount of customer data of a bank, extracting customer features related to outgoing labels from the customer data, using the customer features as feature vectors of samples, using the outgoing labels registered in account opening as sample labels, and constructing a weak supervision sample set W_D；

Selecting partial samples from the weakly supervised sample set, manually verifying whether corresponding clients go out or not, and constructing a strongly supervised sample set S by using a verification result as a sample label_D；

Step 2, constructing a classification model and utilizing a weak supervision sample set W_DAnd strongly supervised sample set S_DTraining the classification model to obtain an identification model of a salesman;

and 3, extracting the client characteristics from the client data of the client to be identified, inputting the client characteristics into the out-of-service worker identification model, and outputting to obtain whether the client to be identified is an out-of-service worker.

In a more preferred technical scheme, the client characteristics comprise four types of characteristics of client basic attributes, transaction data, position-related transaction data and asset liability information; the customer base attributes comprise the sex, age and family age of the customer; the transaction data comprises income and consumption data and offline deposit and withdrawal data in a preset time period; the position-related transaction data refers to the income and consumption data of different places; the liability signal includes periodic deposit, demand deposit and loan information.

In a more preferred technical scheme, in the basic attribute class characteristics of the customers, if the gender of the customer data is unknown, the age of the customer data exceeds a corresponding preset value or the age of the customer exceeds a corresponding preset value, the corresponding characteristic value in the customer characteristics is set as missing; in the transaction data type characteristics and the position-related transaction data type characteristics, the value obtained by carrying out logarithmic conversion on the real amount data is used as a corresponding characteristic value.

In a more preferable technical scheme, in the step 1, when a large amount of customer data of a bank is acquired, customer data with account opening time within a preset range is selected.

In a more preferred embodiment, a weakly supervised sample set W is used_DAnd strongly supervised sample set S_DThe specific steps of training the classification model are as follows:

step 2.1, the strongly supervised sample set S_DAnd weakly supervised sample set W_DSamples in (1), each assigned a weight w_sAnd w_wAnd w is_s>w_wThen merging into a training set;

step 2.2, selecting the XGboost algorithm, and determining the hyperparameter of the XGboost algorithm by cross validation and grid search by using the training set

Constructing a Classification model xgb₀；

Step 2.3, in the strong supervision sample set S_DAnd weakly supervised sample set W_DUsing the prediction result and sample label of the classification model by the belief learning algorithmLabeling, namely identifying noise samples in a weak supervision sample set, then updating the weights of the noise samples, and obtaining an optimal weighted sample set and a classification model under each group of weight combination in multiple iterations;

step 2.4, determining a strong supervision sample set S by calculating the evaluation index of the recognition model under the ownership recombination_DAnd weakly supervised sample set W_DAnd (4) an optimal weight combination, wherein a training set of the weight combination is used and an XGboost algorithm is adopted to train a classification model, so that a final outworker identification model is obtained.

In a more preferred technical scheme, the specific steps of step 2.3 are as follows:

step 2.3.1: setting a plurality of groups of weight combinations W { (W)_s1,w_w1),(w_s2,w_w2),…}；

Step 2.3.2: selecting a set of weight combinations (w)_si,w_wi) As initial weights (w) of the respective samples_si ⁽⁰⁾,w_wi ⁽⁰⁾)；

Step 2.3.3: model xgb classification using sample weighted training set₀Training to obtain a classification model

Step 2.3.4: prediction results using classification models

Performing confidence learning with the original label y, and calculating to obtain a noise sample set

Namely, it is

Wherein

Is a weak prisonCombining the ith group of weights in the Du sample set to obtain a noise sample subset through the t-th weight iteration,

combining the ith group of weights for the strong supervision sample set with a noise sample subset obtained by the t-th weight iteration;

step 2.3.5: updating a subset of noise samples

For a subset of noise samples, i.e.

J (th) noise sample

Weight w of_ij ^(t)At the t-th iteration, the update is made to be w_ij ^(t)←w_ij ⁽⁰⁾×α^t＝w_ij ^(t-1)X α, t is the current iteration number, α is the weight attenuation coefficient, and 0<α<1；

Step 2.3.6: repeatedly iterating the steps 2.3.3-2.3.5 for the current weight combination until the evaluation index of the classification model is converged, namely the evaluation index does not change along with the weight update, and obtaining the optimal classification model under the current weight combination

And a corresponding evaluation index value, wherein T is the iteration number when the evaluation index is converged;

step 2.3.7: selecting different weight combinations to carry out the steps 2.3.3-2.3.6; and selecting the classification model with the highest evaluation index from the ownership recombination as a final outworkers identification model.

In a more optimal technical scheme, the classification accuracy is adopted as an evaluation index.

In a more preferred solution, the belief learning algorithm used in step 2.3.4 identifies noise sample labels by estimating a conditional probability distribution between the probability of a predicted label and a potentially correct sample label, assuming that the noise is conditioned on the class, relying only on the "potentially correct" class, and not on the data, based on the classification noise process assumption.

The present invention also provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements any of the above-mentioned method technical solutions.

The present invention also provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program is characterized in that the computer program is executed by a processor, and any of the above-mentioned method solutions is provided.

Advantageous effects

The invention has the beneficial effects that:

1) excessive human resources are not required to be consumed, a large number of labeled samples are obtained in a mode of investigation and electric visit, only a small number of verification samples are required to be obtained to construct a strong supervision sample set, model training can be completed together with a weak supervision sample set, and cost is reduced;

2) by combining the methods of belief learning and weight updating, a sample (weakly supervised sample set) with noise can be added into a training set, and although the sample has noise, the method of the invention reduces the negative influence of the noise sample on a prediction model by reducing the weight of the noise sample, which is equivalent to purifying noise data to some extent; the purified data is added into a training set, so that the diversity of samples is increased, and the generalization capability of the model is improved on the original basis;

3) the XGboost algorithm in the gradient lifting algorithm family is used, and the algorithm is insensitive to missing values, so that interpolation processing of the missing values is not needed; the complexity of the model is constrained by a displayed regularization method, so that overfitting is avoided; meanwhile, the convergence speed of the model and the construction speed of the subtrees in the model are accelerated by the technologies of approximation of the second derivative of the target function, parallel feature sequencing and feature segmentation profit calculation;

4) the method for extracting the characteristics combines the expert rules, selects the characteristics of the customers by the method, can grasp the characteristic indexes describing the characteristics of target groups of the customers of the bank while reducing characteristic exploration and selection, decomposes the qualitative indexes into the combination of quantitative indexes, and simply and effectively establishes and trains an identification model;

5) for the update iteration of the subsequent model, the classification model is trained by a method of dividing the noise data and the accurate data into a weak supervision sample set and a strong supervision sample set and combining the weak supervision sample set and the strong supervision sample set into a mixed sample set with proper weight, and the subsequent iteration update only needs to be adjusted properly, the strong supervision data set is added, and the model is retrained.

Drawings

FIG. 1 is a general flow diagram of the process of the present invention;

FIG. 2 is a flow chart of a method of constructing training sample data;

FIG. 3 is a flow chart of a method of constructing a classification model of the present invention;

FIG. 4 is a flow chart of the present invention for the identification of the attendant.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides a method for identifying a worker out of business based on bank data, which comprises the following steps as shown in fig. 1:

step 1, constructing a supervision sample set with reference to fig. 2;

step 1.1, constructing a weak supervision sample set

Uncertainty in the flow of people due to the change of the outgoing label over time when the client registers for an account can lead to an actual outgoing estimateThe accuracy of the method is reduced. However, in a short-term time range, the disturbance is relatively small, so that the data in the time range has certain accuracy. Therefore, the embodiment firstly obtains a large amount of customer data of the bank in a proper time range, extracts customer features associated with outgoing labels from the customer data and uses the customer features as feature vectors of samples, uses the outgoing labels registered in an account opening as sample labels, and constructs the weakly supervised sample set W_D. Therefore, excessive human resources are prevented from being consumed, a large number of labeled strong supervision samples are obtained through investigation and electric visit, the accuracy of time factors and the diversity of sample types are balanced through the weak supervision samples, and the generalization capability of the identification model of the workers out of business is improved.

The customer characteristics are determined according to expert rules, and compared with the method of blindly extracting characteristics from various attribute dimensions of customer data, the characteristics of a bank customer target group can be easily grasped, the extraction work of the customer characteristics is greatly reduced, meanwhile, the improvement of the identification accuracy rate is facilitated, the generalization capability can be improved to a great extent, and the learning time can be effectively shortened.

Specifically, the specific process of extracting the client features is as follows: in combination with the researched expert rules, wherein the rules relate to three major categories of geographic positions, transaction data and customer attributes, computable and mathematical characterization (such as regular deposit of money to a local bank card in a certain area by using a foreign bank card) is difficult to realize due to part of the rules, and meanwhile, the validity of the rules is greatly reduced due to data loss. Therefore, in the embodiment, the expert rules are combined, additional asset and liability information is supplemented, and finally extracted client features comprise four types of features of client basic attributes, transaction data, position-related transaction data and asset and liability information; the customer base attributes comprise the sex, age and family age of the customer; the transaction data comprises income and consumption data and offline deposit and withdrawal data in a preset time period; the position-related transaction data refers to the income and consumption data of different places; the liability signal includes periodic deposit, demand deposit and loan information. The customer profile is shown in table 1:

TABLE 1 characterization details

In the embodiment, 62 client features are specifically extracted, and the judgment indexes related to the expert rules are directly or indirectly contained. Because of the data quality problem, the basic attribute characteristics of the client are subjected to data cleaning: (1) in the basic attribute class characteristics of the customers, if the gender and the age of the customer are unknown, or the age of the customer exceeds a corresponding preset value, setting corresponding characteristic values in the customer characteristics as missing; (2) in the transaction data type characteristics and the position-related transaction data type characteristics, because the field value difference of the amount type is too large, in order to reduce the influence caused by too large dimension, the value obtained by carrying out logarithmic conversion on the real amount data is used as a corresponding characteristic value, and the logarithmic conversion formula is as follows:

the amt is the real amount of money data,

is the characteristic value obtained after logarithmic transformation.

Step 1.2, constructing a strong supervision sample set

Selecting a proper number of partial samples from the weakly supervised sample set, manually verifying whether corresponding clients go out, and constructing a strongly supervised sample set S by using a verification result as a sample label_D. The accuracy of the sample label is high because the strongly supervised sample is verified manually.

Step 2, constructing a classification model and utilizing a weak supervision sample set W_DAnd strongly supervised sample set S_DTraining the classification model to obtain an identification model of a salesman; referring to fig. 3, the method specifically includes:

step 2.1, the strongly supervised sample set S_DAnd weakly supervised sample set W_DSamples in (1), each assigned a weight w_sAnd w_sAnd w is_s>w_sThen merging into a training set;

and 2. step 2.2, selecting the XGboost algorithm, and determining the hyperparameter of the XGboost algorithm by cross validation and grid search by using a training set

Constructing a Classification model xgb₀(ii) a Wherein the hyperparameter

Is argmin_θThe overall accuracy of the experimental optimal solution of loss (y, xgb (x; theta)) is generally low, related to the quality of the weakly supervised data set, but does not prevent the determination of well-behaved hyperparameters

Step 2.3, in the strong supervision sample set S_DAnd weakly supervised sample set W_DIn the multiple groups of weight combinations, the noise samples in the weakly supervised sample set are identified by using the prediction result and the sample labels of the classification model through a belief learning algorithm, then the weights of the noise samples are updated, and the optimal weighted sample set and identification model under each group of weight combinations are obtained in multiple iterations; wherein, the weighted sample set is a training set formed by all samples with weights; the method specifically comprises the following steps:

step 2.3.1: setting multiple groups of weight combination W ═ W_s1,w_w1),(w_s2,w_w2),…}；

Step 2.3.3: utilizing the training set after sample weighting and adopting XGboost algorithm to carry out classification model xgb₀Training to obtain a classification model

Step 2.3.4: prediction results using classification models

Namely, it is

Wherein

Combining the ith weight for the weak supervision sample set with the t-th noise sample subset obtained by the weight iteration,

the belief learning algorithm used in this step 2.3.4, based on the classification noise process assumption, assumes that the noise sample is conditional on class, relies only on the "potentially correct" class, and not on the data, identifies the noise sample label by estimating the conditional probability distribution between the probability of the predicted label and the potentially correct label.

Step 2.3.5: updating a subset of noise samples

For a subset of noise samples, i.e.

J (th) noise sample

Weight w of_ij ^(t)At the t-th iteration, the update is made to be w_ij ^(t)←w_ij ⁽⁰⁾×α^t＝w_ij ^(t-1)X alpha, t is the current iteration number, alpha is the weight attenuationCoefficient of, and 0<α<1；

Step 2.3.6: repeatedly iterating the steps 2.3.3-2.3.5 for the current weight combination until the evaluation index of the classification model is converged, namely the evaluation index does not change along with the weight update, thus obtaining the optimal weighted sample set and classification model under the current weight combination

And a corresponding evaluation index value, wherein T is the iteration number when the evaluation index is converged; the embodiment adopts the accuracy as an evaluation index;

step 2.3.7: selecting different weight combinations to carry out the steps 2.3.3-2.3.6;

step 2.4, determining a strong supervision sample set S by calculating the evaluation index of the recognition model under the ownership recombination_DAnd weakly supervised sample set W_DAnd (4) an optimal weight combination, and training a classification model by using the optimal weighted sample set under the weight combination and the iteration times when the evaluation indexes are converged to obtain a final out-of-service worker identification model.

Step 3, predicting the customer to be identified by using the trained out-of-service worker identification model, and identifying out-of-service worker:

referring to fig. 4, a list of customers to be identified is first determined; and then, extracting 62 pieces of client characteristic data required by the office worker identification model from the client data of the client to be identified, inputting the client characteristic data serving as input data of the office worker identification model into the office worker identification model, and outputting to obtain whether the client to be identified is a office worker or not.

The present invention also provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of the above embodiment when executing the computer program.

The present invention also provides a computer-readable storage medium, which stores a computer program, wherein the computer program is executed by a processor to implement the method of the above embodiment

The above embodiments are preferred embodiments of the present application, and those skilled in the art can make various changes or modifications without departing from the general concept of the present application, and such changes or modifications should fall within the scope of the claims of the present application.

Claims

1. A bank data-based method for identifying a person who goes out of service is characterized by comprising the following steps:

step 1, constructing a supervision sample set;

2. The method of claim 1, wherein the customer characteristics include four types of characteristics of customer base attributes, transaction data, location-related transaction data, and asset liability information; the customer base attributes comprise the sex, age and family age of the customer; the transaction data comprises income and consumption data and offline deposit and withdrawal data in a preset time period; the position-related transaction data refers to the income and consumption data of different places; the liability signal includes periodic deposit, demand deposit and loan information.

3. The method of claim 2, wherein in the customer base attribute class characteristics, if the gender is unknown, the age exceeds a corresponding preset value or the age of the user exceeds a corresponding preset value in the customer data, the corresponding characteristic value in the customer characteristics is set as missing; in the transaction data type characteristics and the position-related transaction data type characteristics, the value obtained by carrying out logarithmic conversion on the real amount data is used as a corresponding characteristic value.

4. The method according to claim 1, wherein step 1 selects customer data with an account opening time within a preset range when acquiring a large amount of customer data of a bank.

5. The method according to any of claims 1-4, characterized in that a weakly supervised sample set W is utilized_DAnd strongly supervised sample set S_DThe specific steps of training the classification model are as follows:

Constructing a Classification model xgb₀；

Step 2.3, in the strong supervision sample set S_DAnd weakly supervised sample set W_DIn the multiple groups of weight combinations, the noise samples in the weakly supervised sample set are identified by using the prediction result and the sample labels of the classification model through a belief learning algorithm, then the weights of the noise samples are updated, and the optimal weighted sample set and classification model under each group of weight combinations are obtained in multiple iterations;

step 2.4, determining a strong supervision sample set S by calculating the evaluation index of the recognition model under the ownership recombination_DAnd weakly supervised samplesCollection W_DAnd (4) an optimal weight combination, wherein a training set of the weight combination is used and an XGboost algorithm is adopted to train a classification model, so that a final outworker identification model is obtained.

6. The method according to claim 5, characterized in that the specific steps of step 2.3 are:

Step 2.3.4: prediction results using classification models

Namely, it is

Wherein

step 2.3.5: updating a subset of noise samples

For a subset of noise samples, i.e.

J (th) noise sample

7. The method of claim 6, wherein the assessment indicator is classification accuracy.

8. The method according to claim 6, characterized in that the belief learning algorithm used in step 2.3.4 identifies noise sample labels by estimating the conditional probability distribution between the probability of a predicted label and a potentially correct sample label, based on categorical noise process assumptions, assuming that the noise is conditional on the category, relying only on the "potentially correct" category, and not on the data.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.