CN112308705A - Method, equipment and medium for identifying out-of-service worker based on bank data - Google Patents

Method, equipment and medium for identifying out-of-service worker based on bank data Download PDF

Info

Publication number
CN112308705A
CN112308705A CN202011237043.3A CN202011237043A CN112308705A CN 112308705 A CN112308705 A CN 112308705A CN 202011237043 A CN202011237043 A CN 202011237043A CN 112308705 A CN112308705 A CN 112308705A
Authority
CN
China
Prior art keywords
sample set
data
customer
sample
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011237043.3A
Other languages
Chinese (zh)
Inventor
尹卓英
龙军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202011237043.3A priority Critical patent/CN112308705A/en
Publication of CN112308705A publication Critical patent/CN112308705A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/03Credit; Loans; Processing thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/02Marketing; Price estimation or determination; Fundraising
    • G06Q30/0201Market modelling; Market analysis; Collecting market data

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • General Physics & Mathematics (AREA)
  • Development Economics (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • Economics (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Game Theory and Decision Science (AREA)
  • Medical Informatics (AREA)
  • Technology Law (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a method, a device and a medium for identifying out-of-service workers based on bank data, wherein the method comprises the following steps: acquiring a large amount of customer data of a bank, extracting customer features related to outgoing labels from the customer data, using the customer features as feature vectors of samples, using the outgoing labels registered in an account opening as sample labels, and constructing a weak supervision sample set; selecting a part of samples from the weak supervision sample set, and constructing a strong supervision sample set by using a result of manual verification as a sample label; constructing a classification model, and training the classification model by using a weak supervision sample set and a strong supervision sample set to obtain an out-of-service worker identification model; and (3) extracting the client characteristics from the client data of the client to be identified, inputting the client characteristics into the out-of-service worker identification model, and outputting to obtain whether the client to be identified is an out-of-service worker. According to the method, the strength and weakness supervision sets are divided, the acquisition difficulty of training samples is reduced, the diversity of the samples is ensured, the generalization performance of the recognition model is improved, and the recognition accuracy is high.

Description

Method, equipment and medium for identifying out-of-service worker based on bank data
Technical Field
The invention relates to the technical field of bank specific customer group identification, in particular to a method, equipment and medium for identifying out-of-service worker based on bank data.
Background
With the development of related technologies in the financial industry, the financial industry is undergoing tremendous changes. Commercial banking faces serious challenges with the rise of emerging financial companies such as financial financing and internet finance. Meanwhile, as the operation modes of all commercial banks are gradually improved, the difference of the service levels among banks is gradually reduced, all banks gradually become homogeneous, and relevant marketing strategies are customized according to the characteristics of the customers of the banks, so that the method is an important ring for winning in increasingly violent competition.
The Guizhou is a labor output province, and Guizhou agriculture trust makes great effort and contribution to the service of the outworkers, provides financial knowledge, enriching information and right-keeping help for the farmers, and gives the farmers the greatest care to the groups. And the farmer and civil workers are accurately and effectively identified, and the method is greatly helpful for subsequent service and accurate marketing.
In the identification of the people of the business workers, the traditional method is to use an account opening registration system to display that the outgoing labels are added when the customers contact the bank for the first time, and the accuracy of the labels is greatly reduced along with the change of time due to the flow of the people and the uncertainty of the label aging; in addition, partial labels can be corrected by telephone inquiry and return visit of service personnel, but the method is not suitable for large-scale and periodic operation due to overlarge investment cost; moreover, the method can obtain certain achievements through the expert rules for identification, but the rules are usually limited by subjective cognition, and partial rules are difficult to realize computability, so that only limited target groups can be identified, and meanwhile, the expert rules need to be updated and maintained by consuming manpower subsequently along with the development of time so as to keep the effectiveness of the expert rules.
The efficiency and the accuracy of the method in the identification of the people who go out to be serviced cannot be guaranteed. At present, a machine learning method is utilized to extract characteristic data of a client, sample data with guaranteed accuracy is obtained through combination, and a recognition rule is automatically learned through a supervised learning algorithm to recognize a target group. However, whether the rules can be accurately and effectively recognized and whether the learned rules have generalization depends on how to extract features of the client and whether an accurate training sample can be obtained. While it is the management model of most banks to divide bank customers according to properties such as assets, preferences and liabilities, commercial bank customer data is often of high dimensional complexity, so that the difficulty in selecting features will also increase. Meanwhile, the accuracy of the recorded labels during account opening cannot be guaranteed, a large number of accurate sample labels are obtained in a manual investigation mode, so that the cost is high, the obtained sample amount is too small, the learned model has large variance and low generalization, and the challenge is filled in how to obtain the samples meeting the learning algorithm requirements at the cost as low as possible.
Disclosure of Invention
The invention aims to provide a method, equipment and a medium for identifying out-of-service workers based on bank data, which can obtain an identification model of the out-of-service workers at the lowest cost and have higher accuracy and generalization.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
a method for identifying out-of-service workers based on belief learning comprises the following steps:
step 1, constructing a supervision sample set;
acquiring a large amount of customer data of a bank, extracting customer features related to outgoing labels from the customer data, using the customer features as feature vectors of samples, using the outgoing labels registered in account opening as sample labels, and constructing a weak supervision sample set WD
Selecting partial samples from the weakly supervised sample set, manually verifying whether corresponding clients go out or not, and constructing a strongly supervised sample set S by using a verification result as a sample labelD
Step 2, constructing a classification model and utilizing a weak supervision sample set WDAnd strongly supervised sample set SDTraining the classification model to obtain an identification model of a salesman;
and 3, extracting the client characteristics from the client data of the client to be identified, inputting the client characteristics into the out-of-service worker identification model, and outputting to obtain whether the client to be identified is an out-of-service worker.
In a more preferred technical scheme, the client characteristics comprise four types of characteristics of client basic attributes, transaction data, position-related transaction data and asset liability information; the customer base attributes comprise the sex, age and family age of the customer; the transaction data comprises income and consumption data and offline deposit and withdrawal data in a preset time period; the position-related transaction data refers to the income and consumption data of different places; the liability signal includes periodic deposit, demand deposit and loan information.
In a more preferred technical scheme, in the basic attribute class characteristics of the customers, if the gender of the customer data is unknown, the age of the customer data exceeds a corresponding preset value or the age of the customer exceeds a corresponding preset value, the corresponding characteristic value in the customer characteristics is set as missing; in the transaction data type characteristics and the position-related transaction data type characteristics, the value obtained by carrying out logarithmic conversion on the real amount data is used as a corresponding characteristic value.
In a more preferable technical scheme, in the step 1, when a large amount of customer data of a bank is acquired, customer data with account opening time within a preset range is selected.
In a more preferred embodiment, a weakly supervised sample set W is usedDAnd strongly supervised sample set SDThe specific steps of training the classification model are as follows:
step 2.1, the strongly supervised sample set SDAnd weakly supervised sample set WDSamples in (1), each assigned a weight wsAnd wwAnd w iss>wwThen merging into a training set;
step 2.2, selecting the XGboost algorithm, and determining the hyperparameter of the XGboost algorithm by cross validation and grid search by using the training set
Figure BDA0002767040430000021
Constructing a Classification model xgb0
Step 2.3, in the strong supervision sample set SDAnd weakly supervised sample set WDUsing the prediction result and sample label of the classification model by the belief learning algorithmLabeling, namely identifying noise samples in a weak supervision sample set, then updating the weights of the noise samples, and obtaining an optimal weighted sample set and a classification model under each group of weight combination in multiple iterations;
step 2.4, determining a strong supervision sample set S by calculating the evaluation index of the recognition model under the ownership recombinationDAnd weakly supervised sample set WDAnd (4) an optimal weight combination, wherein a training set of the weight combination is used and an XGboost algorithm is adopted to train a classification model, so that a final outworker identification model is obtained.
In a more preferred technical scheme, the specific steps of step 2.3 are as follows:
step 2.3.1: setting a plurality of groups of weight combinations W { (W)s1,ww1),(ws2,ww2),…};
Step 2.3.2: selecting a set of weight combinations (w)si,wwi) As initial weights (w) of the respective samplessi (0),wwi (0));
Step 2.3.3: model xgb classification using sample weighted training set0Training to obtain a classification model
Figure BDA0002767040430000031
Step 2.3.4: prediction results using classification models
Figure BDA0002767040430000032
Performing confidence learning with the original label y, and calculating to obtain a noise sample set
Figure BDA0002767040430000033
Namely, it is
Figure BDA0002767040430000034
Wherein
Figure BDA0002767040430000035
Figure BDA0002767040430000036
Is a weak prisonCombining the ith group of weights in the Du sample set to obtain a noise sample subset through the t-th weight iteration,
Figure BDA0002767040430000037
combining the ith group of weights for the strong supervision sample set with a noise sample subset obtained by the t-th weight iteration;
step 2.3.5: updating a subset of noise samples
Figure BDA0002767040430000038
For a subset of noise samples, i.e.
Figure BDA0002767040430000039
J (th) noise sample
Figure BDA00027670404300000310
Weight w ofij (t)At the t-th iteration, the update is made to be wij (t)←wij (0)×αt=wij (t-1)X α, t is the current iteration number, α is the weight attenuation coefficient, and 0<α<1;
Step 2.3.6: repeatedly iterating the steps 2.3.3-2.3.5 for the current weight combination until the evaluation index of the classification model is converged, namely the evaluation index does not change along with the weight update, and obtaining the optimal classification model under the current weight combination
Figure BDA00027670404300000311
And a corresponding evaluation index value, wherein T is the iteration number when the evaluation index is converged;
step 2.3.7: selecting different weight combinations to carry out the steps 2.3.3-2.3.6; and selecting the classification model with the highest evaluation index from the ownership recombination as a final outworkers identification model.
In a more optimal technical scheme, the classification accuracy is adopted as an evaluation index.
In a more preferred solution, the belief learning algorithm used in step 2.3.4 identifies noise sample labels by estimating a conditional probability distribution between the probability of a predicted label and a potentially correct sample label, assuming that the noise is conditioned on the class, relying only on the "potentially correct" class, and not on the data, based on the classification noise process assumption.
The present invention also provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the processor implements any of the above-mentioned method technical solutions.
The present invention also provides a computer-readable storage medium, in which a computer program is stored, wherein the computer program is characterized in that the computer program is executed by a processor, and any of the above-mentioned method solutions is provided.
Advantageous effects
The invention has the beneficial effects that:
1) excessive human resources are not required to be consumed, a large number of labeled samples are obtained in a mode of investigation and electric visit, only a small number of verification samples are required to be obtained to construct a strong supervision sample set, model training can be completed together with a weak supervision sample set, and cost is reduced;
2) by combining the methods of belief learning and weight updating, a sample (weakly supervised sample set) with noise can be added into a training set, and although the sample has noise, the method of the invention reduces the negative influence of the noise sample on a prediction model by reducing the weight of the noise sample, which is equivalent to purifying noise data to some extent; the purified data is added into a training set, so that the diversity of samples is increased, and the generalization capability of the model is improved on the original basis;
3) the XGboost algorithm in the gradient lifting algorithm family is used, and the algorithm is insensitive to missing values, so that interpolation processing of the missing values is not needed; the complexity of the model is constrained by a displayed regularization method, so that overfitting is avoided; meanwhile, the convergence speed of the model and the construction speed of the subtrees in the model are accelerated by the technologies of approximation of the second derivative of the target function, parallel feature sequencing and feature segmentation profit calculation;
4) the method for extracting the characteristics combines the expert rules, selects the characteristics of the customers by the method, can grasp the characteristic indexes describing the characteristics of target groups of the customers of the bank while reducing characteristic exploration and selection, decomposes the qualitative indexes into the combination of quantitative indexes, and simply and effectively establishes and trains an identification model;
5) for the update iteration of the subsequent model, the classification model is trained by a method of dividing the noise data and the accurate data into a weak supervision sample set and a strong supervision sample set and combining the weak supervision sample set and the strong supervision sample set into a mixed sample set with proper weight, and the subsequent iteration update only needs to be adjusted properly, the strong supervision data set is added, and the model is retrained.
Drawings
FIG. 1 is a general flow diagram of the process of the present invention;
FIG. 2 is a flow chart of a method of constructing training sample data;
FIG. 3 is a flow chart of a method of constructing a classification model of the present invention;
FIG. 4 is a flow chart of the present invention for the identification of the attendant.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The embodiment provides a method for identifying a worker out of business based on bank data, which comprises the following steps as shown in fig. 1:
step 1, constructing a supervision sample set with reference to fig. 2;
step 1.1, constructing a weak supervision sample set
Uncertainty in the flow of people due to the change of the outgoing label over time when the client registers for an account can lead to an actual outgoing estimateThe accuracy of the method is reduced. However, in a short-term time range, the disturbance is relatively small, so that the data in the time range has certain accuracy. Therefore, the embodiment firstly obtains a large amount of customer data of the bank in a proper time range, extracts customer features associated with outgoing labels from the customer data and uses the customer features as feature vectors of samples, uses the outgoing labels registered in an account opening as sample labels, and constructs the weakly supervised sample set WD. Therefore, excessive human resources are prevented from being consumed, a large number of labeled strong supervision samples are obtained through investigation and electric visit, the accuracy of time factors and the diversity of sample types are balanced through the weak supervision samples, and the generalization capability of the identification model of the workers out of business is improved.
The customer characteristics are determined according to expert rules, and compared with the method of blindly extracting characteristics from various attribute dimensions of customer data, the characteristics of a bank customer target group can be easily grasped, the extraction work of the customer characteristics is greatly reduced, meanwhile, the improvement of the identification accuracy rate is facilitated, the generalization capability can be improved to a great extent, and the learning time can be effectively shortened.
Specifically, the specific process of extracting the client features is as follows: in combination with the researched expert rules, wherein the rules relate to three major categories of geographic positions, transaction data and customer attributes, computable and mathematical characterization (such as regular deposit of money to a local bank card in a certain area by using a foreign bank card) is difficult to realize due to part of the rules, and meanwhile, the validity of the rules is greatly reduced due to data loss. Therefore, in the embodiment, the expert rules are combined, additional asset and liability information is supplemented, and finally extracted client features comprise four types of features of client basic attributes, transaction data, position-related transaction data and asset and liability information; the customer base attributes comprise the sex, age and family age of the customer; the transaction data comprises income and consumption data and offline deposit and withdrawal data in a preset time period; the position-related transaction data refers to the income and consumption data of different places; the liability signal includes periodic deposit, demand deposit and loan information. The customer profile is shown in table 1:
TABLE 1 characterization details
Figure BDA0002767040430000051
In the embodiment, 62 client features are specifically extracted, and the judgment indexes related to the expert rules are directly or indirectly contained. Because of the data quality problem, the basic attribute characteristics of the client are subjected to data cleaning: (1) in the basic attribute class characteristics of the customers, if the gender and the age of the customer are unknown, or the age of the customer exceeds a corresponding preset value, setting corresponding characteristic values in the customer characteristics as missing; (2) in the transaction data type characteristics and the position-related transaction data type characteristics, because the field value difference of the amount type is too large, in order to reduce the influence caused by too large dimension, the value obtained by carrying out logarithmic conversion on the real amount data is used as a corresponding characteristic value, and the logarithmic conversion formula is as follows:
Figure BDA0002767040430000061
the amt is the real amount of money data,
Figure BDA0002767040430000062
is the characteristic value obtained after logarithmic transformation.
Step 1.2, constructing a strong supervision sample set
Selecting a proper number of partial samples from the weakly supervised sample set, manually verifying whether corresponding clients go out, and constructing a strongly supervised sample set S by using a verification result as a sample labelD. The accuracy of the sample label is high because the strongly supervised sample is verified manually.
Step 2, constructing a classification model and utilizing a weak supervision sample set WDAnd strongly supervised sample set SDTraining the classification model to obtain an identification model of a salesman; referring to fig. 3, the method specifically includes:
step 2.1, the strongly supervised sample set SDAnd weakly supervised sample set WDSamples in (1), each assigned a weight wsAnd wsAnd w iss>wsThen merging into a training set;
and 2. step 2.2, selecting the XGboost algorithm, and determining the hyperparameter of the XGboost algorithm by cross validation and grid search by using a training set
Figure BDA0002767040430000063
Constructing a Classification model xgb0(ii) a Wherein the hyperparameter
Figure BDA0002767040430000064
Is argminθThe overall accuracy of the experimental optimal solution of loss (y, xgb (x; theta)) is generally low, related to the quality of the weakly supervised data set, but does not prevent the determination of well-behaved hyperparameters
Figure BDA0002767040430000065
Step 2.3, in the strong supervision sample set SDAnd weakly supervised sample set WDIn the multiple groups of weight combinations, the noise samples in the weakly supervised sample set are identified by using the prediction result and the sample labels of the classification model through a belief learning algorithm, then the weights of the noise samples are updated, and the optimal weighted sample set and identification model under each group of weight combinations are obtained in multiple iterations; wherein, the weighted sample set is a training set formed by all samples with weights; the method specifically comprises the following steps:
step 2.3.1: setting multiple groups of weight combination W ═ Ws1,ww1),(ws2,ww2),…};
Step 2.3.2: selecting a set of weight combinations (w)si,wwi) As initial weights (w) of the respective samplessi (0),wwi (0));
Step 2.3.3: utilizing the training set after sample weighting and adopting XGboost algorithm to carry out classification model xgb0Training to obtain a classification model
Figure BDA0002767040430000066
Step 2.3.4: prediction results using classification models
Figure BDA0002767040430000067
Performing confidence learning with the original label y, and calculating to obtain a noise sample set
Figure BDA0002767040430000068
Namely, it is
Figure BDA0002767040430000069
Wherein
Figure BDA00027670404300000610
Figure BDA00027670404300000611
Combining the ith weight for the weak supervision sample set with the t-th noise sample subset obtained by the weight iteration,
Figure BDA00027670404300000612
combining the ith group of weights for the strong supervision sample set with a noise sample subset obtained by the t-th weight iteration;
the belief learning algorithm used in this step 2.3.4, based on the classification noise process assumption, assumes that the noise sample is conditional on class, relies only on the "potentially correct" class, and not on the data, identifies the noise sample label by estimating the conditional probability distribution between the probability of the predicted label and the potentially correct label.
Step 2.3.5: updating a subset of noise samples
Figure BDA0002767040430000071
For a subset of noise samples, i.e.
Figure BDA0002767040430000072
J (th) noise sample
Figure BDA0002767040430000073
Weight w ofij (t)At the t-th iteration, the update is made to be wij (t)←wij (0)×αt=wij (t-1)X alpha, t is the current iteration number, alpha is the weight attenuationCoefficient of, and 0<α<1;
Step 2.3.6: repeatedly iterating the steps 2.3.3-2.3.5 for the current weight combination until the evaluation index of the classification model is converged, namely the evaluation index does not change along with the weight update, thus obtaining the optimal weighted sample set and classification model under the current weight combination
Figure BDA0002767040430000074
And a corresponding evaluation index value, wherein T is the iteration number when the evaluation index is converged; the embodiment adopts the accuracy as an evaluation index;
step 2.3.7: selecting different weight combinations to carry out the steps 2.3.3-2.3.6;
step 2.4, determining a strong supervision sample set S by calculating the evaluation index of the recognition model under the ownership recombinationDAnd weakly supervised sample set WDAnd (4) an optimal weight combination, and training a classification model by using the optimal weighted sample set under the weight combination and the iteration times when the evaluation indexes are converged to obtain a final out-of-service worker identification model.
Step 3, predicting the customer to be identified by using the trained out-of-service worker identification model, and identifying out-of-service worker:
referring to fig. 4, a list of customers to be identified is first determined; and then, extracting 62 pieces of client characteristic data required by the office worker identification model from the client data of the client to be identified, inputting the client characteristic data serving as input data of the office worker identification model into the office worker identification model, and outputting to obtain whether the client to be identified is a office worker or not.
The present invention also provides a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of the above embodiment when executing the computer program.
The present invention also provides a computer-readable storage medium, which stores a computer program, wherein the computer program is executed by a processor to implement the method of the above embodiment
The above embodiments are preferred embodiments of the present application, and those skilled in the art can make various changes or modifications without departing from the general concept of the present application, and such changes or modifications should fall within the scope of the claims of the present application.

Claims (10)

1. A bank data-based method for identifying a person who goes out of service is characterized by comprising the following steps:
step 1, constructing a supervision sample set;
acquiring a large amount of customer data of a bank, extracting customer features related to outgoing labels from the customer data, using the customer features as feature vectors of samples, using the outgoing labels registered in account opening as sample labels, and constructing a weak supervision sample set WD
Selecting partial samples from the weakly supervised sample set, manually verifying whether corresponding clients go out or not, and constructing a strongly supervised sample set S by using a verification result as a sample labelD
Step 2, constructing a classification model and utilizing a weak supervision sample set WDAnd strongly supervised sample set SDTraining the classification model to obtain an identification model of a salesman;
and 3, extracting the client characteristics from the client data of the client to be identified, inputting the client characteristics into the out-of-service worker identification model, and outputting to obtain whether the client to be identified is an out-of-service worker.
2. The method of claim 1, wherein the customer characteristics include four types of characteristics of customer base attributes, transaction data, location-related transaction data, and asset liability information; the customer base attributes comprise the sex, age and family age of the customer; the transaction data comprises income and consumption data and offline deposit and withdrawal data in a preset time period; the position-related transaction data refers to the income and consumption data of different places; the liability signal includes periodic deposit, demand deposit and loan information.
3. The method of claim 2, wherein in the customer base attribute class characteristics, if the gender is unknown, the age exceeds a corresponding preset value or the age of the user exceeds a corresponding preset value in the customer data, the corresponding characteristic value in the customer characteristics is set as missing; in the transaction data type characteristics and the position-related transaction data type characteristics, the value obtained by carrying out logarithmic conversion on the real amount data is used as a corresponding characteristic value.
4. The method according to claim 1, wherein step 1 selects customer data with an account opening time within a preset range when acquiring a large amount of customer data of a bank.
5. The method according to any of claims 1-4, characterized in that a weakly supervised sample set W is utilizedDAnd strongly supervised sample set SDThe specific steps of training the classification model are as follows:
step 2.1, the strongly supervised sample set SDAnd weakly supervised sample set WDSamples in (1), each assigned a weight wsAnd wwAnd w iss>wwThen merging into a training set;
step 2.2, selecting the XGboost algorithm, and determining the hyperparameter of the XGboost algorithm by cross validation and grid search by using the training set
Figure FDA0002767040420000011
Constructing a Classification model xgb0
Step 2.3, in the strong supervision sample set SDAnd weakly supervised sample set WDIn the multiple groups of weight combinations, the noise samples in the weakly supervised sample set are identified by using the prediction result and the sample labels of the classification model through a belief learning algorithm, then the weights of the noise samples are updated, and the optimal weighted sample set and classification model under each group of weight combinations are obtained in multiple iterations;
step 2.4, determining a strong supervision sample set S by calculating the evaluation index of the recognition model under the ownership recombinationDAnd weakly supervised samplesCollection WDAnd (4) an optimal weight combination, wherein a training set of the weight combination is used and an XGboost algorithm is adopted to train a classification model, so that a final outworker identification model is obtained.
6. The method according to claim 5, characterized in that the specific steps of step 2.3 are:
step 2.3.1: setting a plurality of groups of weight combinations W { (W)s1,ww1),(ws2,ww2),…};
Step 2.3.2: selecting a set of weight combinations (w)si,wwi) As initial weights (w) of the respective samplessi (0),wwi (0));
Step 2.3.3: model xgb classification using sample weighted training set0Training to obtain a classification model
Figure FDA0002767040420000021
Step 2.3.4: prediction results using classification models
Figure FDA0002767040420000022
Performing confidence learning with the original label y, and calculating to obtain a noise sample set
Figure FDA0002767040420000023
Namely, it is
Figure FDA0002767040420000024
Wherein
Figure FDA0002767040420000025
Figure FDA0002767040420000026
Combining the ith weight for the weak supervision sample set with the t-th noise sample subset obtained by the weight iteration,
Figure FDA0002767040420000027
combining the ith group of weights for the strong supervision sample set with a noise sample subset obtained by the t-th weight iteration;
step 2.3.5: updating a subset of noise samples
Figure FDA0002767040420000028
For a subset of noise samples, i.e.
Figure FDA0002767040420000029
J (th) noise sample
Figure FDA00027670404200000210
Weight w ofij (t)At the t-th iteration, the update is made to be wij (t)←wij (0)×αt=wij (t-1)X α, t is the current iteration number, α is the weight attenuation coefficient, and 0<α<1;
Step 2.3.6: repeatedly iterating the steps 2.3.3-2.3.5 for the current weight combination until the evaluation index of the classification model is converged, namely the evaluation index does not change along with the weight update, and obtaining the optimal classification model under the current weight combination
Figure FDA00027670404200000211
And a corresponding evaluation index value, wherein T is the iteration number when the evaluation index is converged;
step 2.3.7: selecting different weight combinations to carry out the steps 2.3.3-2.3.6; and selecting the classification model with the highest evaluation index from the ownership recombination as a final outworkers identification model.
7. The method of claim 6, wherein the assessment indicator is classification accuracy.
8. The method according to claim 6, characterized in that the belief learning algorithm used in step 2.3.4 identifies noise sample labels by estimating the conditional probability distribution between the probability of a predicted label and a potentially correct sample label, based on categorical noise process assumptions, assuming that the noise is conditional on the category, relying only on the "potentially correct" category, and not on the data.
9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 8 when executing the computer program.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 8.
CN202011237043.3A 2020-11-09 2020-11-09 Method, equipment and medium for identifying out-of-service worker based on bank data Pending CN112308705A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011237043.3A CN112308705A (en) 2020-11-09 2020-11-09 Method, equipment and medium for identifying out-of-service worker based on bank data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011237043.3A CN112308705A (en) 2020-11-09 2020-11-09 Method, equipment and medium for identifying out-of-service worker based on bank data

Publications (1)

Publication Number Publication Date
CN112308705A true CN112308705A (en) 2021-02-02

Family

ID=74326531

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011237043.3A Pending CN112308705A (en) 2020-11-09 2020-11-09 Method, equipment and medium for identifying out-of-service worker based on bank data

Country Status (1)

Country Link
CN (1) CN112308705A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496105A (en) * 2022-09-28 2022-12-20 广东省新黄埔中医药联合创新研究院 Sleep prediction model training method, sleep condition prediction method and related device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115496105A (en) * 2022-09-28 2022-12-20 广东省新黄埔中医药联合创新研究院 Sleep prediction model training method, sleep condition prediction method and related device
CN115496105B (en) * 2022-09-28 2023-10-24 广东省新黄埔中医药联合创新研究院 Sleep prediction model training method, sleep condition prediction method and related devices

Similar Documents

Publication Publication Date Title
CN111444951B (en) Sample recognition model generation method, device, computer equipment and storage medium
CN110852755B (en) User identity identification method and device for transaction scene
CN111444952A (en) Method and device for generating sample identification model, computer equipment and storage medium
US20200311585A1 (en) Multi-model based account/product sequence recommender
CN109583966A (en) A kind of high value customer recognition methods, system, equipment and storage medium
Kuo et al. Expenditure-based segmentation: application of quantile regression to analyse the travel expenditures of baby boomer households
CN113469730A (en) Customer repurchase prediction method and device based on RF-LightGBM fusion model under non-contract scene
CN110532429B (en) Online user group classification method and device based on clustering and association rules
CN108804577B (en) Method for estimating interest degree of information tag
Conley et al. Estimating dynamic local interactions models
WO2020135642A1 (en) Model training method and apparatus employing generative adversarial network
CN113610552A (en) User loss prediction method and device
Tang et al. Model identification for ARMA time series through convolutional neural networks
CN104850868A (en) Customer segmentation method based on k-means and neural network cluster
CN112487284A (en) Bank customer portrait generation method, equipment, storage medium and device
CN115018357A (en) Farmer portrait construction method and system for production performance improvement
CN112308705A (en) Method, equipment and medium for identifying out-of-service worker based on bank data
Ching et al. Hidden Markov models and their applications to customer relationship management
CN112541010B (en) User gender prediction method based on logistic regression
US11551317B2 (en) Property valuation model and visualization
Awwad et al. Efficient worker selection through history-based learning in crowdsourcing
CN115293867A (en) Financial reimbursement user portrait optimization method, device, equipment and storage medium
Olaru et al. A tutorial on novel item and person sampling procedures for personality research
CN116822569A (en) Model training method, device, related equipment and storage medium
CN113570455A (en) Stock recommendation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210202