CN111612519A

CN111612519A - Method, device and storage medium for identifying potential customers of financial product

Info

Publication number: CN111612519A
Application number: CN202010287989.4A
Authority: CN
Inventors: 张琦; 薛毅; 陶多秀; 郑金涛; 方伟; 陈强; 曾杰鹏
Original assignee: Gf Securities Co ltd
Current assignee: Gf Securities Co ltd
Priority date: 2020-04-13
Filing date: 2020-04-13
Publication date: 2020-09-01
Anticipated expiration: 2040-04-13
Also published as: CN111612519B

Abstract

The embodiment of the invention provides a method for identifying potential customers of financial products, which comprises the following steps: screening out all clients with confirmation and subscription records from original clients as first target clients within a set time period; extracting user characteristics of the first target customer; performing characteristic engineering processing on the user characteristics of the first target customer to establish an index system; performing characteristic crossing on the user characteristics of the first target client to establish a characteristic system; grading all original customers through a PU-learning algorithm according to the similarity between the original customers and the index system and the feature system; reordering all of said ranked customers for said new product preferences based on their recognition of the new product to the customer history, similarity of the product in the subscription record, and transaction amount. The method solves the problems of cold start and data sparseness in a collaborative filtering algorithm, and can improve marketing conversion efficiency.

Description

Method, device and storage medium for identifying potential customers of financial product

Technical Field

The present invention relates to the field of machine learning, and in particular, to a method, an apparatus, and a storage medium for identifying potential customers of financial products.

Background

In the retail business of the securities company, various product services are provided for different types of customers, wherein the mass customers mainly comprise abundant financial and financial products such as public fund, resource management and financial plan, cash and financial products and the like. According to the guidance comments on the assets management business of the standard financial institution, which are published in 2019, 4, and 27, how to discover a potential matching customer group through the self-attribute of the product and convert the potential matching customer group into a target customer has huge challenges and opportunities: on one hand, although the stock company accumulates massive customer data (customer attributes, transaction behaviors and the like), inactive customers have a large proportion and often have relatively less related information; on one hand, the matching between the product and the client is required to be carried out on the premise of compliance, so that the direct target client group base of the product is less. The method has the advantages that the potential customer groups of the financial products are effectively identified through the technology, so that accurate marketing aiming at the financial products is realized, and the assisted retail business is transformed to wealth management.

At present, the accurate marketing of products in the financial industry is mainly realized through a collaborative filtering algorithm, and the collaborative filtering method comprises two main algorithms: client-based collaborative filtering and product-based collaborative filtering algorithms. The client-based collaborative filtering algorithm is mainly used for recommending products to a target client according to historical purchase favorite product records of similar clients; the product-based collaborative filtering algorithm is to recommend products to a group of customers with similar product interest preferences. In addition, there is a context-based recommendation algorithm, which recommends products with similar attributes according to the client's own preferences, for example, the client frequently browses stock-related information relative to bond-type information, and preferentially recommends stock-type funds to the client if there are both stock-type and bond-type funds.

However, there is a cold start problem in the collaborative filtering algorithm.

First, collaborative filtering algorithms fail to recommend financial products to customers who have not purchased the product. Whether the collaborative filtering algorithm based on the customer or the collaborative filtering algorithm of the product recommends to the customers who have purchased the product, and potential customers who have not purchased the product but may wish to purchase the product cannot be hit.

Secondly, collaborative filtering cannot deal with the problem of sparse data well. In combination with business, the population base number of financial product purchase is small in the whole customer, and low transaction frequency causes rare preference data, so that information matrix data of products purchased by customers are extremely sparse, and finally, a collaborative filtering recommendation candidate set is few, and the recommendation effect is poor. By its very nature, collaborative filtering algorithms are solutions for inventory customers as well as existing products.

The context-based recommendation algorithm needs additional data such as customer behaviors in a specific scene, data collection is difficult, potential customers are difficult to accurately identify aiming at newly released financial products, and related schemes are difficult to have universality and difficult to popularize.

Disclosure of Invention

The invention aims to provide a method, a device and a storage medium for identifying potential customers of financial products, which solve the problems of cold start and data sparsity in a collaborative filtering algorithm and can also improve the conversion rate and the product marketing conversion efficiency.

In order to solve the technical problem, an embodiment of the present invention provides a method for identifying potential customers of a financial product, where the method includes:

screening out all clients with confirmation and subscription records from original clients as first target clients within a set time period;

extracting user characteristics of the first target customer; the user characteristics of the first target customer include: a social attribute, an asset attribute, and a transaction attribute;

performing feature engineering processing on the user features of the first target customer, including: carrying out feature discretization on the user features of the first target client, and establishing an index system;

performing characteristic crossing on the user characteristics of the first target client to establish a characteristic system;

grading all original customers through a two-step method of a PU-learning algorithm according to the similarity of the index system and the feature system; wherein the PU-learning algorithm comprises determining negative samples and training a classifier; the classifier is a logistic regression classifier;

reordering all said ranked preferred customers for a new product based on their recognition from the customer history, similarity to the product in the subscription record, and transaction amount.

Further, grading all original customers through a two-step method of a PU-learning algorithm according to the similarity of the index system and the feature system;

specifically, according to the similarity between the original customer and the index system and the similarity between the original customer and the feature system, the probability that each original customer is a positive sample is calculated by using the logistic regression classification model; wherein the positive sample is a first target customer;

all original customers are ranked according to the probability that each original customer is a positive sample.

Further, reordering all the graded clients according to the preference of the new product and the product similarity in the client history recognition and subscription records and the transaction amount, specifically, calculating the similarity between the new product and the product in each original client history recognition and subscription record according to the risk level, the asset type, the investment deadline type and the investment variety type, and preprocessing the transaction amount as the weight information to obtain an additional score;

adding the probability that each original customer is a positive sample and the extra score of each original customer;

and sorting the addition results in a descending order from large to small so as to realize the reordering of all the graded clients.

Further, an objective function for training the classifier is determined by the following formula:

wherein r is 0.2, α is a hyperparameter, the regularization terms use paranomics L1 and L2, θ is a model weight parameter vector, MSE (θ) refers to the mean square error of a model predicted value and an actual value, and the purpose of training is to analyze the solution of θ when J (θ) is the minimum value.

Further, the social attributes include age, gender, and education level;

the asset attributes include: average net asset yield, average position of all products, average position of stocks;

the transaction attributes include: the method comprises the following steps of period profit and loss rate, period hand-changing rate, risk level, investment period, investment variety and 4 service opening authority conditions, wherein the 4 services are financing and financing coupons, scientific and creative boards, harbor stock access and individual stock options.

An embodiment of the present invention further provides an apparatus for identifying potential customers of a financial product, including:

the screening module is used for screening out all clients with confirmation and subscription records in a set time period from the original clients as first target clients;

the user characteristic extraction module is used for extracting the user characteristics of the first target client; the user characteristics of the first target customer include: a social attribute, an asset attribute, and a transaction attribute;

a feature engineering processing module, configured to perform feature engineering processing on the user feature of the first target client, including: carrying out feature discretization on the user features of the first target client, and establishing an index system; performing characteristic crossing on the user characteristics of the first target client to establish a characteristic system;

the classification module is used for classifying all customers through a two-step method of a PU-learning algorithm according to the similarity between the index system and the feature system; wherein the PU-learning algorithm comprises determining negative samples and training a classifier; the classifier is a logistic regression classifier;

and the sequencing module is used for sequencing all the graded preference customers aiming at the new product according to the similarity between the new product and the product in the customer history recognition and purchase applying record and the transaction amount.

Further, reordering all graded customers according to the similarity between the new product and the product in the customer history recognition and subscription record and the transaction amount, specifically, calculating the similarity between the new product and the product in each original customer history recognition and subscription record according to the risk grade, the asset type, the investment deadline type and the investment variety type, preprocessing the transaction amount as weight information, and obtaining an additional score;

The present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium comprises a stored computer program, and wherein the computer program, when executed, controls an apparatus in which the computer-readable storage medium is located to perform the method for identifying potential customers of a financial product according to any one of the above.

According to the embodiment of the invention, the user characteristics of all clients with approval and subscription records in a set time period (in the latest time period) are extracted, including social attributes, asset attributes and transaction attributes, the problem of negative sample marking is solved through a PU-learning algorithm, and the positive and negative samples are marked for the clients without purchasing products, so that all original clients (including the users without subscription records and the users with subscription records) can be diffused based on the user characteristics of all clients with approval and subscription records, and the investment preferences of the clients are graded according to the characteristic similarity through characteristic capture, so that the clients with higher conversion rate can be known through the grade number of the clients.

Since the customers with high conversion rate do not have to have higher investment preference for various types of financial products, the potential customers need to be further distinguished for specific products so as to accurately identify the potential customers of the products. According to the embodiment of the invention, on the basis of grading all original customers, all graded customers are reordered according to the similarity between the new product and the product in the customer history recognition and subscription records and the transaction amount, so that the potential customers of a certain specific product can be accurately identified, and the marketing conversion efficiency is improved.

In addition, the embodiment of the invention also relieves the problem of sparse original characteristic data by performing characteristic discretization on the user characteristics, and overcomes the defects of a linear model by increasing the dimensionality of the characteristics through characteristic intersection.

The classifier adopted by the traditional PU-learning two-step method is a naive Bayes classifier, but the naive Bayes classifier has strong hypothesis on characteristics, and the embodiment of the invention solves the overfitting problem possibly caused by the naive Bayes classifier by selecting the logistic regression classifier. The embodiment of the invention also has universality.

Drawings

FIG. 1 is a flow chart illustrating a method for identifying potential customers of a financial product in accordance with an embodiment of the present invention;

FIG. 2 is a flowchart of the first stage of the PU-learning algorithm.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In the financial industry, from the business level, only customers who have purchased products, namely positive samples, do not exist, and negative samples do not exist.

Positive examples and unlabeled sample Learning (PU or LPU Learning) are short for semi-supervised binary classification models, and are mainly used for solving the problem that Positive samples can be clearly determined but negative samples cannot be determined.

Unlike the general classification problem, the size of P (positive sample) in the PU problem is usually quite small, and it is difficult to expand the positive sample set; while the size of U (unlabeled sample) is typically large, such as in web page classification, unidentified web page resources can be obtained from the network very cheaply and conveniently. The purpose of introducing U is to reduce the preparation workload of manual classification, improve the precision and achieve the effect of automatic classification as far as possible.

The PU-learning two-step method mainly comprises the following two steps:

(1) finding out a Reliable Negative sample set (RN for short) in the unlabelled sample set U according to the labeled positive sample P, and converting the PU problem into a two-classification problem;

(2) and obtaining a binary classifier by iterative training by utilizing the positive sample and the negative sample.

Referring to fig. 1, based on the characteristics of the PU-learning algorithm, an embodiment of the present invention provides a method for identifying potential customers of a financial product, including:

and S1, screening all clients with confirmation and subscription records in a set time period from the original clients as first target clients.

The original customers comprise users with the record of approval and subscription, and also comprise users without the record of approval and subscription.

Preferably, a period of time closest to the new hair product is selected as the set period of time.

S2, extracting the user characteristics of the first target client; the user characteristics of the first target customer include: social attributes, asset attributes, and transaction attributes.

Preferably, the social attributes include age, gender, and education level. The asset attributes include: average net asset yield during the period, average position of all products during the period, and average position of stocks during the period. The transaction attributes include: the method comprises the following steps of period profit and loss rate, period hand-changing rate, risk level, investment period, investment variety and 4 service permission conditions, wherein the 4 services comprise financing and financing coupons, scientific and creative boards, harbor stock access and individual stock options.

S3, performing feature engineering processing on the user features of the first target client, wherein the feature engineering processing comprises the following steps: carrying out feature discretization on the user features of the first target client, and establishing an index system; and performing characteristic intersection on the user characteristics of the first target client to establish a characteristic system.

It should be noted that the basic assumption of discretization of the continuous features is that, by default, the contribution of the values of the continuous features in different intervals to the result is different. In the index information of the client, part of indexes belong to continuous indexes: such as age, term average net asset yield, term profit-loss rate, term hand-off rate, term average of all products position and term average of stock position. In the embodiment of the present invention, although the features are numerical, the addition and subtraction of the values of the features is of no practical significance, and the numerical features should also be regarded as discrete features, discretized based on rules or data distribution forms (determining segmented values according to upper and lower quantiles) and converted into discrete indexes, and the rest of the discrete features are directly incorporated into an index system.

After the feature discretization process, some features that are meaningful after cross-combining are crossed. The intersection is theoretically intended to introduce interaction between features, i.e. to introduce non-linearity. Crossovers can be introduced when the different expressions for the individual features have been combined with one another to give a practical meaning. Each new feature obtained covers the collaborative information after the interaction of the multiple feature representations. Meanwhile, more effective characteristics can be developed.

Preferably, considering the judgment of the subjective scene, the following features are mainly crossed:

i. the third term crosses: age (5), gender (2), mean net asset over period (4);

two term crossings: the period hand-changing rate (5), the period profit and loss rate (7), the average positions of all products (4) and the average position of stocks (4);

according to the degree relation of the polynomials, assuming X, Y, Z represents a feature, the combination of the biquadratic polynomials has the form: x, Y, XY, respectively; the combination of cubic polynomials is: x, Y, Z, XY, XZ, YZ, XYZ.

The remaining uncrossed features are:

education degree (4), risk level (6), investment term (4), investment variety (4), 4 service opening conditions (financing and financing instrument, scientific plate, harbor stock and individual stock option) (4), and the characteristics are shared: 22.

The final feature system then comprises: 99+189+ 22-289 features.

The parenthesized numerals such as age (5) and sex (2) mean the category of age and the category of sex, respectively, such as male and female in sex; the numerals in parentheses in other features are analogized.

S4, grading all original customers through a two-step method of a PU-learning algorithm according to the similarity between the index system and the characteristic system; wherein the PU-learning algorithm comprises determining negative samples and training a classifier; the classifier is a logistic regression classifier.

Before applying the PU-learning algorithm, firstly all clients that have a last approval and subscription record, i.e. the first target client, are marked as positive samples, and the rest are unmarked samples.

Specifically, according to the similarity between the original customer and the index system and the similarity between the original customer and the feature system, the probability that each original customer is a positive sample is calculated by using the logistic regression classification model;

ranking all original customers according to the probability that each original customer is a positive sample

In the embodiment of the present invention, as shown in fig. 2, a first step of the PU-learning two-step method includes firstly extracting a part of positive sample data S from a positive sample set P, merging the positive sample data S with an unmarked set U to form a set MS, marking the P class as c1, marking the MS class as c2, determining a threshold t of negative sample probability in c2 according to a distribution of S by using a Spy-EM (Spy-EM) algorithm, finding out a reliable negative sample set N from the set MS, and further determining a negative sample. Wherein, M represents an unmarked sample set, S represents a spy sample set, and t is a threshold value determined according to the performance of S samples in the classifier.

The key to finding a reliable negative sample set by using a Spy-EM (Spy sample) algorithm is that, in an initial state, a part (e.g., 15%) of the positive sample set is extracted as "Spy" (Spy), and the Spy and the unlabeled samples are combined together to form a negative sample set (U + S), so that the negative sample set participates in a training set to train the classifier.

And secondly, continuing to apply the Spy-EM algorithm, putting the S (Spy sample) back into the P (positive sample), forming a training set with the negative sample set N determined in the previous step, and training a new classifier. And re-classifying the U by using a new classifier, and classifying the samples which are shown as negative into new N in iteration, namely, the samples in the N are changed after each iteration. When the EM algorithm converges, the training is ended. Then in n iterations, a new classifier is generated each time, and the final classifier is determined as one of the binary classifiers with the best accuracy through cross validation.

The classifier adopted by the traditional PU Learning two-step method is a naive Bayes classifier, and the naive Bayes classifies various types of samples by calculating the posterior probability of the sample slave in the classification process, so that the assumption of no correlation among characteristics is required to be satisfied. In the embodiment of the invention, the original features have strong correlation, so the method is not suitable for training naive Bayes classification.

The embodiment of the invention improves the PU-learning algorithm, uses the logistic regression classifier to replace a naive Bayesian classifier, and directly reflects the importance of the characteristics on the global classification influence on the parameter item values corresponding to the variables while training the classifier.

Wherein, X is a training set sample, and omega is a parameter item.

In one preferred embodiment, the objective function for training the classifier is:

In the embodiment of the invention, regarding the regulation of the regular term, for a machine learning model, it is very important to prevent the problem of 'overfitting', and some models have high fitting degree in a training set but are difficult to realize good generalization effect in a test set. In order to effectively solve the problem of 'overfitting', the algorithm is converged on proper precision, parameters to be optimized are constrained (limited), an 'elastic network' regular term is added into a target cost function, namely the combination of an L1 norm term (a regular term of lasso regression) and an L2 norm term (a regular term of ridge regression) of the parameters is used as the cost of model complexity, and the aim of controlling the fitting precision is fulfilled.

The elastic network is a compromise between ridge regression and Lasso regression, controlled by the mixing ratio (mix ratio) r:

when r is 0, the elastic network becomes ridge regression;

when r is 1, the elastic network becomes Lasso regression;

lasso regression tends to eliminate unimportant weights, but Lasso regression appears unstable in models of strongly correlated features; while ridge regression tends to bring the value of the feature parameters close to 0, but fails to cull insignificant features. In the inventive example, the value of r was finally determined to be 0.2.

Decision trees are also a common classification method, and the classification is refined by adding feature nodes layer by layer, so that the method has the advantage of being capable of mining detailed features of data, but once a hierarchy is formed, the relationship between the hierarchy and other levels or nodes is cut off, and the continuous downward mining can be only carried out in a local part. Meanwhile, due to segmentation, the number of samples is continuously reduced, so that multivariate simultaneous inspection cannot be supported. And, for the samples classified under the same node, the classification probabilities are the same, but in the scenario of the present invention, it is desirable to realize the ranking of the classification probabilities of all the samples.

S5, re-ranking all the ranked customers' preferences for the new product based on their recognition of the new product to the customer history, similarity of the product in the subscription record, and transaction amount.

Specifically, according to the risk level, the asset type, the investment term type and the investment variety type, the similarity between a newly-issued product and the product in the record of approval and subscription of each original customer history is calculated, and the transaction amount is preprocessed to be used as weight information f (amt)_i) Obtaining an additional Score; wherein the additional score is calculated by the formula:

dist＝∑_j(a_j-b_j)²,j＝1,2,3,4

wherein, the a product represents a new product and the b product represents a product purchased in each transaction.

After obtaining the additional scores, the probability that each original customer is a positive sample and the additional scores of each original customer are added, and the addition results are sorted in descending order from large to small, so that all the graded customers are reordered.

According to the embodiment of the invention, the user characteristics of all clients with approval and subscription records in a set time period (in the latest time period) are extracted, including social attributes, asset attributes and transaction attributes, the problem of negative sample marking is solved through a PU-learning algorithm, and the positive and negative samples are marked for the clients without purchasing products, so that all original clients (including the users without subscription records and the users with subscription records) can be diffused based on the user characteristics of all clients with approval and subscription records, and the users are classified according to the characteristic similarity through characteristic capture, so that the clients with higher conversion rate can be known through the grade number of the clients.

Since the customers with high conversion rate do not have to have higher investment preference for various types of financial products, the potential customers need to be further distinguished for specific products so as to accurately identify the potential customers of the products. According to the embodiment of the invention, on the basis of grading all original customers, all graded customers are reordered according to the similarity between the newly-issued product and the product in each approval and subscription record and the transaction amount, so that the potential customers of a certain specific product can be accurately identified, and the marketing conversion efficiency is improved.

the classification module is used for classifying all original customers through a two-step method of a PU-learning algorithm according to the similarity between the index system and the feature system; wherein the PU-learning algorithm comprises determining negative samples and training a classifier; the classifier is a logistic regression classifier;

and the sequencing module is used for reordering all the graded clients according to the new products, the similarity of the products in the client history recognition and subscription records and the transaction amount according to the new products and the preference of the new products.

In a preferred embodiment, the method comprises the steps of grading all original customers according to the similarity with the index system and the feature system through a two-step method of a PU-learning algorithm;

specifically, according to the similarity between the original customer and the index system and the similarity between the original customer and the feature system, the probability that each original customer is a positive sample is calculated by using the logistic regression classification model; wherein the positive sample is a first target customer

In one preferred embodiment, all the graded clients are reordered according to the similarity between the new products and the products in the client history recognition and subscription records and the transaction amount, specifically, the similarity between the new products and the products in each original client history recognition and subscription records is calculated according to the risk level, the asset type, the investment term type and the investment variety type, and the transaction amount is reduced to be processed as weight information to obtain an additional score;

and sorting the addition results from large to small in a descending order to realize the reordering of all the graded clients.

It should be understood that the apparatus of the present invention is one-to-one corresponding to the above-mentioned method for identifying potential customers of financial products, i.e. the above-mentioned identification method is applicable to the apparatus, and therefore the present invention is not described herein in detail.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium includes a stored computer program, where the computer program, when running, controls an apparatus on which the computer-readable storage medium is located to perform the method for identifying potential customers of a financial product as described above.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention.

Claims

1. A method of identifying potential customers of a financial product, comprising:

reordering all of said ranked customers' preferences with respect to new products based on their recognition from the customer history, similarity to products in the subscription record, and transaction amount.

2. The method of identifying potential customers of a financial product of claim 1, wherein all original customers are ranked according to their similarity to the index system and the feature system by a two-step method of the PU-learning algorithm;

3. The method of identifying potential customers of financial products according to claim 2, wherein all ranked customers are reordered with respect to new product preferences based on their recognition of the new product from customer history, similarity of products in the subscription records, and transaction amount, in particular:

calculating the similarity between the newly-issued product and the product in the history confirmation and subscription record of each original client according to the risk level, the asset type, the investment term type and the investment variety type, preprocessing the transaction amount as weight information, and obtaining an additional score;

4. The method of identifying potential customers of a financial product of any one of claims 1-3, wherein the objective function for training the classifier is determined by the formula:

5. The method of identifying potential customers of a financial product of claim 1, wherein the social attributes include age, gender, and education;

the transaction attributes include: the method comprises the following steps of period profit and loss rate, period hand-changing rate, risk level, investment period, investment variety and 4 service opening conditions, wherein the 4 services are financing and financing instruments, scientific and wound boards, harbor stock and individual stock options.

6. An apparatus for identifying potential customers of a financial product, comprising:

7. The apparatus for identifying potential customers of financial products of claim 6, wherein all original customers are ranked according to their similarity to the index system and the feature system by a two-step method of PU-learning algorithm;

8. The apparatus for identifying potential customers of financial products according to claim 7, wherein all the classified customers are reordered according to the similarity of the new product to the products in the customer history approval/subscription records and the transaction amount, and specifically, the similarity of the new product to the products in each original customer history approval/subscription records is calculated according to the risk level, the asset type, the investment deadline type and the investment variety type, and the transaction amount is preprocessed as the weighting information to obtain the additional score;

9. A computer-readable storage medium comprising a stored computer program, wherein the computer program, when executed, controls an apparatus on which the computer-readable storage medium resides to perform the method of identifying potential customers of a financial product of any one of claims 1-5.