WO2020114302A1

WO2020114302A1 - Behavior prediction method

Info

Publication number: WO2020114302A1
Application number: PCT/CN2019/121492
Authority: WO
Inventors: 梁栋; 王珊珊; 程慧涛; 刘新; 郑海荣
Original assignee: 深圳先进技术研究院
Priority date: 2018-12-04
Filing date: 2019-11-28
Publication date: 2020-06-11
Also published as: CN109766911A

Abstract

Disclosed is a behavior prediction method, belonging to the technical field of information. A behavior of a user is predicted by using data; however, attributes of the existing data are completely different, and actually, these pieces of data possibly have no association; therefore, a one-size-fits-all data processing method is not suitable for precise prediction under big data at present. The method comprises: step 1, fusing One-Hot Encoding and Label Encoding into a multi-dimensional feature code; step 2, representing collected sample data as the multi-dimensional feature code in step 1; step 3, enriching existing label data by using a generative adversarial network; step 4, integrating a plurality of models, carrying out repeated training to generate a weight factor of each model, and then classifying the data obtained in step 3 after an integrated model with weights is obtained; and step 5, outputting a predicted behavior. By means of the method, data classification is more accurate, and a behavior of a user is effectively predicted.

Description

A behavior prediction method

Technical field

This application belongs to the field of information technology, and particularly relates to a behavior prediction method.

Background technique

Feature encoding methods have a long history and are commonly used in machine learning. Feature encoding is roughly divided into two categories, one is One-Hot Encoding, and the other is Label Encoding. Of the two methods, the first is suitable for unrelated data and used for independent analysis. Such feature encoding can ensure its independent and identical distribution characteristics; the second Label Encoding is suitable for the case where the data is quite huge , In order to prevent dimensional disasters and simplify data. Generative adversarial networks (GAN) are widely used in unsupervised algorithms in machine learning.

Predict the user's behavior through data, but many of the data now available are objective attribute data of the user and some other behavior data. The attributes of these data are completely different, and they cannot be well unified. If they are converted into some kind of Decimal numbers will forcibly add some kind of numerical correlation to these data, but in fact the data may not be related by itself. The one-size-fits-all data processing method is not suitable for accurate prediction under big data now.

Summary of the invention

1. Technical problems to be solved

Based on the data to predict the user's behavior, but many of the data now available are objective attribute data of the user and some other behavior data. The attributes of these data are completely different and cannot be well unified. If they are converted into a certain This kind of decimal number will forcibly add some kind of numerical correlation to these data, but in fact the data may not be related by itself. The one-size-fits-all data processing method is not suitable for the problem of accurate prediction under big data now. This application provides a Behavior prediction method.

2. Technical solutions

In order to achieve the above objective, the present application provides a behavior prediction method. The method includes the following steps:

Step 1. Integrate One-Hot Encoding and Label Encoding into multi-dimensional feature codes;

Step 2. Represent the collected sample data as the multi-dimensional feature code in Step 1;

Step 3. Enrich the existing tag data by generating an adversarial network;

Step 4. Integrate multiple models together and train repeatedly to generate a weighting factor for each model. After obtaining an integrated model with weights, classify the data obtained in step 3;

Step 5. Output prediction behavior.

Optionally, the data of the One-Hot Encoding encoding part in step 1 is a series of numbers with the same attribute expressed by binary numbers; the data only represents an objective fact and has no numerical meaning.

Optionally, the data in the Encoding part of the Label in step 1 represents a weight or value, which has a mathematical meaning; there is a correlation between the data, and there is a connection within the class; the data is expressed in decimal not exceeding two digits .

Optionally, the step 3 includes repeatedly generating fake data through a generator, and then discriminating whether the generated data is fake data through a discriminator, and continually playing games until it is no longer possible to determine whether this is fake data or real data; The data is used to balance the sample data set.

Optionally, the discrimination formula of the manufactured data is:

Among them, D(x) represents the probability that the data judged by the discriminator is taken from the original data; D(G(z)) represents the probability that the data judged by the discriminator is taken from the generator; x～P _daca means that the data comes from the original data ; Z～P _Z(z) indicates that the data comes from the generator;

Represent the average value;

Min(G)Max(D)P(D,G) means that in the case of the current generator and discriminator P(D,G), it is guaranteed to maximize the discriminator Max(D) while achieving the Min(G) generator The error is minimal.

Optionally, the step 4 includes training different models on different data and then finding out the best performing models in all the trainings, according to the classification accuracy, and according to the accuracy from large to small. Large to small weight ratio.

Optionally, the sum of the weight ratios is 1.

Optionally, the best performing models include a classification decision tree model, a random forest model, an AdaBoost model, and an XGBoost model.

Optionally, the model discriminant is:

Among them, ω ₁ , ω ₂ , ω ₃ , ω ₄ are the weights given by the four models, θ ₁ , θ ₂ , θ ₃ , θ ₄ are the prediction results obtained by the classification decision tree model;

If the value of f(x) exceeds the set threshold, it is judged that the sample is a positive example, and if it does not exceed it, it is a negative example.

Optionally, the behavior includes financial investment behavior.

3. Beneficial effect

Compared with the prior art, the beneficial effects of a behavior prediction method provided by this application are:

The behavior prediction method provided in this application combines the sample data with One-Hot Encoding and Label Encoding to multi-dimensional feature codes, and then uses the generated anti-network to enrich the existing label data, and finally uses the multi-model fusion weights to classify the data and output . It avoids the one-size-fits-all data, makes the effective features of the data fully utilized, and generates the defect that the anti-network balances the sample imbalance, making the data classification more accurate and effectively predicting user behavior.

BRIEF DESCRIPTION

FIG. 1 is a flowchart of a behavior prediction method of the present application.

detailed description

Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings. According to these detailed descriptions, those skilled in the art can clearly understand the present application and can implement the present application. Without violating the principles of the present application, the features in the different embodiments may be combined to obtain new implementations, or to replace certain features in certain embodiments to obtain other preferred implementations.

Referring to FIG. 1, this application provides a behavior prediction method. The method includes the following steps:

Step 3. Enrich the existing tag data by generating an adversarial network;

Step 5. Output prediction behavior.

Optionally, the data in the Encoding part of the Label in step 1 represents a weight or value, which has a mathematical meaning; there is a correlation between the data, and there is a connection within the class; the data is expressed in decimal numbers not exceeding two digits .

Optionally, the discrimination formula of the manufactured data is:

Represent the average value;

Optionally, the sum of the weight ratios is 1.

Optionally, the model discriminant is:

f(x)=(ω ₁ *θ ₁ |ω ₂ *θ ₂ |ω ₃ *θ ₃ |ω ₄ *θ ₄ )/4

Optionally, the behavior includes financial investment behavior.

Examples

This application uses financial investment behavior as an example to illustrate:

For a long time, financial institutions have been plagued by customer data analysis. Financial institutions want to use the data in their hands to make binary classification predictions, that is, whether users will or will not perform certain actions. For example: will you save money, will you apply for a credit card, will you make a loan, etc.

After the financial institution has user data, the first problem it faces is that it cannot find a specific form of data representation, that is, whether to use specific values for analysis or to convert to other data formats for data analysis. This application first A data-based mixed feature coding method is given. Considering the different scenarios where the two coding methods are applied, at the same time, the association between other categories of data within and within the category is carefully analyzed. The data is encoded using the One-Hot Encoding scheme, and the rest of the data is related. If it is affected by the value, Label Encoding is used. The two codes are fused, and for each individual, a long sequence of feature codes including One-Hot Encoding and Label Encoding codes is formed. The data is uniformly encoded and converted. With this encoding, the existing data can be analyzed uniformly and no longer used for other transformations. The converted data input directly corresponds to a classification algorithm to output.

First divide the data into a part that can be encoded with One-Hot Encoding and a part that needs to be encoded with Label Encoding.

One-Hot Encoding data must meet the following conditions:

The data itself is text, but binary numbers are used to represent this series of numbers with the same attributes, that is to say, the data itself has no mathematical attributes, but is represented by coding: for example, gender: male and female are represented by 10 and 00 respectively; for example Seven days a week from Monday to Sunday can be expressed as: 000, 001, 010, 011, 100, 101, 110, 111.

The data is a number, but it only represents an objective fact and has no numerical meaning: for example, age, 23, 25, 62 can be represented by different 0 and 1 code combinations, which can be expressed as the binary code corresponding to decimal, if If the number of digits after encoding differs, 0 is added to the upper digits until the digits of all data with the same attribute are the same. This method uses this encoding method.

Label Encoding should comply with the following requirements:

The data itself represents a weight or value, which is mathematically meaningful. For example, a user has several bank cards: a total of seven possibilities of 1, 2, 3...7, etc., then the Label Encoding is directly 1, 2, 3...7.

The data encoded with LabelEncoding must be related to each other and related within the class. For example, the behavior of one user affects the behavior of another user, so this coding method is generally adopted.

Data encoded with Label Encoding is expressed in decimal not exceeding two digits, that is, up to 99, and the range is 0 to 99 (this is a requirement of this method).

Combining the above two encoding methods, this method specifies that One-Hot Encoding is first and Label Encoding is last. which is:

One-HotEncoding

Label Encoding

Table 1 encoding method

For example the following:

110011

00110

111001

00010

11101

1

twenty three

78

61

twenty four

Table 2 Encoding

The reason why it is called a multi-dimensional feature code is because this encoding method combines two main encoding methods, because considering the two major characteristics of the data, one is One-Hot Encoding encoding, if there is no correlation between the data and appears If the text features are used, then this encoding method is used. The number of digits (ie, the length) of the encoding depends on the situation. There is no hard requirement, as long as the various features in the class can be distinguished, but the encoding of the features in the class must be guaranteed. The length is the same. If you want to use Label Encoding, the value of the data affects the feature result, so you need to keep this part of the data that affects the feature, but if the number of digits exceeds two, it is more convenient to express the corresponding value as the first type. Therefore, two bits are suitable for the second encoding.

Secondly, this application considers that the amount of bank tag data is small. After the feature code conversion, the rich adversarial network (abbreviated as "GAN") can be used to generate a rich number of tag data to generate highly confusing fake samples. These fake samples It is used to enhance the disadvantage of not much labeled sample data. In machine learning, if the difference between the labeled sample and the non-labeled sample is very different, it will cause serious underfitting in training and seriously affect the final classification accuracy, so that the labeled data can be enhanced in batches. A considerable number of false-labeled samples were generated using GAN to achieve data balance. Through experiments, it was found that the balance of data has a significant impact on the final result. The balanced data obviously helps to improve the accuracy of discrimination.

GAN usually consists of two parts, the first part is the generator and the second part is the discriminator. The generator is used to generate fake data repeatedly, and the discriminator is used to identify whether the data given to it by the generator is fake data. The two parts continue to play games until the discriminator can no longer judge whether this is fake data or real data, then this is done. A "counterfeit" process. After the coding shown in Table 1 is completed, n lines of coding as in Table 2 will be generated. Each line represents a user's feature code. We organize the data that has been labeled (that is, it is determined to have a certain behavior) according to Table 1, input this feature table to GAN, and then GAN produces many labeled but artificially for us through the above process The manufactured data is used to balance our sample set.

Suppose that a m*n table is compiled, which represents a labeled data with m samples and n small categories. It is transmitted to the GAN network. In this network, x is used to represent the data on this grid. The generator learns a data distribution P _g . Because there is noise in the data distribution, a noise distribution function is defined: P _z (Z ), this is to ensure the final robustness of the algorithm, the original parameter θ _{g in the} network, so G (z, θ _g ) is defined as a mapping of the original data, this is the generator to generate fake data Principles and methods. The discriminator D(x) is used to indicate the probability that the data comes from x, and training D(x) to maximize the ability, that is, the maximum probability to identify whether the data comes from its own training data set or G(x). At the same time, it also minimizes the log(1-D(G(z))) represented by G. The innermost layer of this formula is the generator. To minimize the formula, the inner G(G(z)) ) Must be maximum, which means that the discriminator maximizes the probability of accurately identifying the content from the generator. Combining the above two contents, we get:

After the algorithm iterates repeatedly until it converges or satisfies that Min(G)Max(D) is less than a certain value, the construction of a generator and discriminator is completed. The data generated by the generator at this time is the band that can be used. The fake data of the label. In this way, the disadvantages of the positive and negative samples in the balanced sample are too different.

Finally, multi-model fusion, there are many algorithms for classification in machine learning, these models include decision tree model, random forest model and AdaBoost model, etc. In this application, a variety of models are fused, using voting rules, and finally connected in parallel A large classifier is used for classification, and the weight method is used to merge into a strong model for classification.

Some models are sensitive to data, but some models are not sensitive to data. For example, data-sensitive models include support vector machines (referred to as "SVM"), linear regression models (referred to as "LR"); data-insensitive models include Decision Tree model and Random Forest. Models, etc.; the models with excellent performance in model integration include AdaBoost algorithm and XGBoost algorithm. Considering that our data is also divided into two types: unrelated and related. Therefore, we use the method of voting weight to calculate. Multi-model fusion is to use different models to train different data and then find the four models with the best performance in all trainings. According to the correct rate of their classification, according to the accuracy from big to small, give the weight from big to small Proportion, these weight ratios add up to 1.

According to many experiments, the following four models are the best for our data and the behavior of financial customers to be analyzed: (Because the following four methods have mature systems and expressions, they will not be repeated here. θ is the probability that the prediction is "yes.")

Decision Tree:

The classification decision tree model is a tree structure that describes the classification of instances. The decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes, which are represented by internal nodes. A feature or attribute, a leaf node represents a class. The main advantage of the decision tree model is that the model is readable and the classification speed is fast. Decision tree learning algorithm is usually a recursive selection of optimal features, and the training data is segmented according to the optimal features, so that each sub-data set has a best classification process. After the decision tree algorithm, a prediction result of θ _{1 is obtained} .

RandomForest:

Random forest is a versatile machine learning algorithm, which refers to a classifier that uses multiple trees to train and predict samples, and can perform regression and classification tasks. It is also one of the important methods in integrated learning. It can show its talents when integrating several inefficient models into an efficient model, so that the final classification effect can exceed an algorithm of a single model. Each splitting process of the subtree in the random forest is to randomly select certain features from all the features to be selected, and then select the optimal feature among the randomly selected features so that the decision trees in the random forest can be different from each other , To enhance the diversity of the system, thereby improving the classification performance. Through the random forest algorithm, the prediction result of the model is θ ₂ .

AdaBoost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) against the same training set, and then combine these weak classifiers to form a stronger final classifier (strong classifier). AdaBoost learns the basic classifier G _t (x) by using the current distribution D _t (x) weighted training data set, and calculates the coefficient α _{i of the} basic classifier G _t (x). α _i represents G _t (x) at the end Importance in the classifier. Then construct a linear combination of basic classifiers:

The expression corresponding to the final classifier is:

From this, θ ₃ , the result of model prediction at this time.

The XGBoost algorithm is a tree-based boosting algorithm. The biggest feature is that it can automatically use the multithreading of the CPU for parallelization, and at the same time improve the algorithm to improve the accuracy. We use XGBoost to get the prediction result as θ ₄ .

According to the accuracy rate, the weights of the four models are given as ω ₁ , ω ₂ , ω ₃ , ω ₄ . If calculated in model 1, the results are θ ₁ , θ ₂ , θ ₃ , θ ₄ . Then the final judgment is:

f(x)=(ω ₁ *θ ₁ +ω ₂ *θ ₂ +ω ₃ *θ ₃ +ω ₄ *θ ₄ )/4

If the value of f(x) exceeds the set threshold, it is judged that the sample is a positive example, and if it does not exceed it, it is a negative example, thereby completing the judgment. (The threshold is artificially set and generally exceeds 0.7 to be considered credible.)

The data structure mixing the original text and numbers is unified into a line of multi-dimensional feature codes. Under the mixed data structure of text and numbers, they are unified into data with the same attributes, which can be processed by the classifier together. The One-Hot Encoding encoding of this feature code is in the front, and the Label Encoding encoding is in the back. And it does not destroy the association of the original data, that is, the independent ones remain independent, and the related ones still guarantee their association.

According to the characteristics of labeling, that is, less manual labeling data, the generated multi-dimensional signatures are used to continuously create "labeled" data in GAN, which is used to balance the problem of low accuracy due to the imbalance of positive and negative samples in the sample set. .

According to the different weights and data sensitivities, a large model with multiple models is designed. This large model is compatible with various sensitive and insensitive data, making its classification results robust. These four methods can effectively predict the financial behavior in several binary categories.

Although the present application has been described above with reference to specific embodiments, those skilled in the art should understand that, within the principle and scope of the present disclosure, many modifications can be made to the configuration and details disclosed in the present application. The protection scope of the present application is determined by the appended claims, and the claims are intended to cover all modifications included in the literal meaning or scope of equivalents of the technical features in the claims.

Claims

A behavior prediction method, characterized in that the method includes the following steps:

Step 1. Integrate One-Hot Encoding and Label Encoding into multi-dimensional feature codes;

Step 2. Represent the collected sample data as the multi-dimensional feature code in Step 1;

Step 3. Enrich the existing tag data by generating an adversarial network;

Step 4. Integrate multiple models together and train repeatedly to generate a weighting factor for each model. After obtaining an integrated model with weights, classify the data obtained in step 3;

Step 5. Output prediction behavior.
The behavior prediction method according to claim 1, characterized in that: the data of the One-Hot Encoding part in step 1 is a series of numbers with the same attribute expressed by binary numbers; the data only represents an objective In fact, there is no numerical meaning.
The behavior prediction method according to claim 1, characterized in that: the data in the Label Encoding part of step 1 represents a weight or value, which has a mathematical meaning; there is a correlation between the data, and there is a connection within the class; The data is expressed in decimal not exceeding two digits.
The behavior prediction method according to claim 1, wherein step 3 includes repeatedly generating fake data through a generator, and then discriminating whether the generated data is fake data through a discriminator, and continually playing games until it is no longer possible to judge this Whether it is fake data or real data; use these manufactured data to balance the sample data set.
The behavior prediction method according to claim 4, wherein the discrimination formula of the manufactured data is:

Among them, D(x) represents the probability that the data judged by the discriminator is taken from the original data; D(G(z)) represents the probability that the data judged by the discriminator is taken from the generator; x～P data indicates that the data comes from the original data ; Z～P Z(z) indicates that the data comes from the generator;
Represent the average value;

Min(G)Max(D)P(D,G) means that under the current generator and discriminator P(D,G), it is guaranteed to maximize the discriminator max(D) while achieving the Min(G) generator The error is minimal.
The behavior prediction method according to claim 1, wherein the step 4 includes training different models on different data and then finding out the best performing models in all the trainings, according to the correct classification Rate, according to the accuracy from the largest to the smallest weight ratio.
The behavior prediction method according to claim 6, wherein the sum of the weight ratios is 1.
The behavior prediction method according to claim 7, wherein the best performing models include a classification decision tree model, a random forest model, an AdaBoost model, and an XGBoost model.
The behavior prediction method according to claim 8, wherein the model discriminant is:

f(x)=(ω 1 *θ 1 +ω 2 *θ 2 +ω 3 *θ 3 +ω 4 *θ 4 )/4

Among them, ω 1 , ω 2 , ω 3 , ω 4 are the weights given by the four models, θ 1 , θ 2 , θ 3 , θ 4 are the prediction results obtained by the classification decision tree model;

If the value of f(x) exceeds the set threshold, it is judged that the sample is a positive example, and if it does not exceed it, it is a negative example.
The behavior prediction method according to any one of claims 1 to 9, wherein the behavior includes financial investment behavior.