WO2020114302A1 - Behavior prediction method - Google Patents

Behavior prediction method Download PDF

Info

Publication number
WO2020114302A1
WO2020114302A1 PCT/CN2019/121492 CN2019121492W WO2020114302A1 WO 2020114302 A1 WO2020114302 A1 WO 2020114302A1 CN 2019121492 W CN2019121492 W CN 2019121492W WO 2020114302 A1 WO2020114302 A1 WO 2020114302A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
behavior
model
prediction method
behavior prediction
Prior art date
Application number
PCT/CN2019/121492
Other languages
French (fr)
Chinese (zh)
Inventor
梁栋
王珊珊
程慧涛
刘新
郑海荣
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2020114302A1 publication Critical patent/WO2020114302A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"

Definitions

  • This application belongs to the field of information technology, and particularly relates to a behavior prediction method.
  • Feature encoding methods have a long history and are commonly used in machine learning. Feature encoding is roughly divided into two categories, one is One-Hot Encoding, and the other is Label Encoding. Of the two methods, the first is suitable for unrelated data and used for independent analysis. Such feature encoding can ensure its independent and identical distribution characteristics; the second Label Encoding is suitable for the case where the data is quite huge , In order to prevent dimensional disasters and simplify data.
  • Generative adversarial networks GAN are widely used in unsupervised algorithms in machine learning.
  • the present application provides a behavior prediction method.
  • the method includes the following steps:
  • Step 1 Integrate One-Hot Encoding and Label Encoding into multi-dimensional feature codes
  • Step 2 Represent the collected sample data as the multi-dimensional feature code in Step 1;
  • Step 3 Enrich the existing tag data by generating an adversarial network
  • Step 4 Integrate multiple models together and train repeatedly to generate a weighting factor for each model. After obtaining an integrated model with weights, classify the data obtained in step 3;
  • Step 5 Output prediction behavior.
  • the data of the One-Hot Encoding encoding part in step 1 is a series of numbers with the same attribute expressed by binary numbers; the data only represents an objective fact and has no numerical meaning.
  • the data in the Encoding part of the Label in step 1 represents a weight or value, which has a mathematical meaning; there is a correlation between the data, and there is a connection within the class; the data is expressed in decimal not exceeding two digits .
  • the step 3 includes repeatedly generating fake data through a generator, and then discriminating whether the generated data is fake data through a discriminator, and continually playing games until it is no longer possible to determine whether this is fake data or real data; The data is used to balance the sample data set.
  • the discrimination formula of the manufactured data is:
  • D(x) represents the probability that the data judged by the discriminator is taken from the original data
  • D(G(z)) represents the probability that the data judged by the discriminator is taken from the generator
  • x ⁇ P daca means that the data comes from the original data
  • Z ⁇ P Z(z) indicates that the data comes from the generator
  • Min(G)Max(D)P(D,G) means that in the case of the current generator and discriminator P(D,G), it is guaranteed to maximize the discriminator Max(D) while achieving the Min(G) generator The error is minimal.
  • the step 4 includes training different models on different data and then finding out the best performing models in all the trainings, according to the classification accuracy, and according to the accuracy from large to small. Large to small weight ratio.
  • the sum of the weight ratios is 1.
  • the best performing models include a classification decision tree model, a random forest model, an AdaBoost model, and an XGBoost model.
  • model discriminant is:
  • ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 are the weights given by the four models, ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 are the prediction results obtained by the classification decision tree model;
  • the behavior includes financial investment behavior.
  • the behavior prediction method provided in this application combines the sample data with One-Hot Encoding and Label Encoding to multi-dimensional feature codes, and then uses the generated anti-network to enrich the existing label data, and finally uses the multi-model fusion weights to classify the data and output . It avoids the one-size-fits-all data, makes the effective features of the data fully utilized, and generates the defect that the anti-network balances the sample imbalance, making the data classification more accurate and effectively predicting user behavior.
  • FIG. 1 is a flowchart of a behavior prediction method of the present application.
  • this application provides a behavior prediction method.
  • the method includes the following steps:
  • Step 1 Integrate One-Hot Encoding and Label Encoding into multi-dimensional feature codes
  • Step 2 Represent the collected sample data as the multi-dimensional feature code in Step 1;
  • Step 3 Enrich the existing tag data by generating an adversarial network
  • Step 4 Integrate multiple models together and train repeatedly to generate a weighting factor for each model. After obtaining an integrated model with weights, classify the data obtained in step 3;
  • Step 5 Output prediction behavior.
  • the data of the One-Hot Encoding encoding part in step 1 is a series of numbers with the same attribute expressed by binary numbers; the data only represents an objective fact and has no numerical meaning.
  • the data in the Encoding part of the Label in step 1 represents a weight or value, which has a mathematical meaning; there is a correlation between the data, and there is a connection within the class; the data is expressed in decimal numbers not exceeding two digits .
  • the step 3 includes repeatedly generating fake data through a generator, and then discriminating whether the generated data is fake data through a discriminator, and continually playing games until it is no longer possible to determine whether this is fake data or real data; The data is used to balance the sample data set.
  • the discrimination formula of the manufactured data is:
  • D(x) represents the probability that the data judged by the discriminator is taken from the original data
  • D(G(z)) represents the probability that the data judged by the discriminator is taken from the generator
  • x ⁇ P daca means that the data comes from the original data
  • Z ⁇ P Z(z) indicates that the data comes from the generator
  • Min(G)Max(D)P(D,G) means that in the case of the current generator and discriminator P(D,G), it is guaranteed to maximize the discriminator Max(D) while achieving the Min(G) generator The error is minimal.
  • the step 4 includes training different models on different data and then finding out the best performing models in all the trainings, according to the classification accuracy, and according to the accuracy from large to small. Large to small weight ratio.
  • the sum of the weight ratios is 1.
  • the best performing models include a classification decision tree model, a random forest model, an AdaBoost model, and an XGBoost model.
  • model discriminant is:
  • ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 are the weights given by the four models, ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 are the prediction results obtained by the classification decision tree model;
  • the behavior includes financial investment behavior.
  • This application uses financial investment behavior as an example to illustrate:
  • This application first A data-based mixed feature coding method is given. Considering the different scenarios where the two coding methods are applied, at the same time, the association between other categories of data within and within the category is carefully analyzed. The data is encoded using the One-Hot Encoding scheme, and the rest of the data is related. If it is affected by the value, Label Encoding is used. The two codes are fused, and for each individual, a long sequence of feature codes including One-Hot Encoding and Label Encoding codes is formed. The data is uniformly encoded and converted. With this encoding, the existing data can be analyzed uniformly and no longer used for other transformations. The converted data input directly corresponds to a classification algorithm to output.
  • the data itself is text, but binary numbers are used to represent this series of numbers with the same attributes, that is to say, the data itself has no mathematical attributes, but is represented by coding: for example, gender: male and female are represented by 10 and 00 respectively; for example Seven days a week from Monday to Sunday can be expressed as: 000, 001, 010, 011, 100, 101, 110, 111.
  • the data is a number, but it only represents an objective fact and has no numerical meaning: for example, age, 23, 25, 62 can be represented by different 0 and 1 code combinations, which can be expressed as the binary code corresponding to decimal, if If the number of digits after encoding differs, 0 is added to the upper digits until the digits of all data with the same attribute are the same. This method uses this encoding method.
  • the data itself represents a weight or value, which is mathematically meaningful. For example, a user has several bank cards: a total of seven possibilities of 1, 2, 3...7, etc., then the Label Encoding is directly 1, 2, 3...7.
  • the data encoded with LabelEncoding must be related to each other and related within the class. For example, the behavior of one user affects the behavior of another user, so this coding method is generally adopted.
  • Data encoded with Label Encoding is expressed in decimal not exceeding two digits, that is, up to 99, and the range is 0 to 99 (this is a requirement of this method).
  • this method specifies that One-Hot Encoding is first and Label Encoding is last. which is:
  • this encoding method combines two main encoding methods, because considering the two major characteristics of the data, one is One-Hot Encoding encoding, if there is no correlation between the data and appears If the text features are used, then this encoding method is used.
  • the number of digits (ie, the length) of the encoding depends on the situation. There is no hard requirement, as long as the various features in the class can be distinguished, but the encoding of the features in the class must be guaranteed. The length is the same.
  • this application considers that the amount of bank tag data is small.
  • the rich adversarial network (abbreviated as "GAN”) can be used to generate a rich number of tag data to generate highly confusing fake samples. These fake samples It is used to enhance the disadvantage of not much labeled sample data.
  • GAN rich adversarial network
  • a considerable number of false-labeled samples were generated using GAN to achieve data balance. Through experiments, it was found that the balance of data has a significant impact on the final result. The balanced data obviously helps to improve the accuracy of discrimination.
  • GAN usually consists of two parts, the first part is the generator and the second part is the discriminator.
  • the generator is used to generate fake data repeatedly, and the discriminator is used to identify whether the data given to it by the generator is fake data.
  • the two parts continue to play games until the discriminator can no longer judge whether this is fake data or real data, then this is done.
  • a "counterfeit" process After the coding shown in Table 1 is completed, n lines of coding as in Table 2 will be generated. Each line represents a user's feature code.
  • a m*n table is compiled, which represents a labeled data with m samples and n small categories. It is transmitted to the GAN network.
  • x is used to represent the data on this grid.
  • the generator learns a data distribution P g . Because there is noise in the data distribution, a noise distribution function is defined: P z (Z ), this is to ensure the final robustness of the algorithm, the original parameter ⁇ g in the network, so G (z, ⁇ g ) is defined as a mapping of the original data, this is the generator to generate fake data Principles and methods.
  • the discriminator D(x) is used to indicate the probability that the data comes from x, and training D(x) to maximize the ability, that is, the maximum probability to identify whether the data comes from its own training data set or G(x). At the same time, it also minimizes the log(1-D(G(z))) represented by G.
  • the innermost layer of this formula is the generator. To minimize the formula, the inner G(G(z)) ) Must be maximum, which means that the discriminator maximizes the probability of accurately identifying the content from the generator. Combining the above two contents, we get:
  • multi-model fusion there are many algorithms for classification in machine learning, these models include decision tree model, random forest model and AdaBoost model, etc.
  • these models include decision tree model, random forest model and AdaBoost model, etc.
  • a variety of models are fused, using voting rules, and finally connected in parallel
  • a large classifier is used for classification, and the weight method is used to merge into a strong model for classification.
  • data-sensitive models include support vector machines (referred to as "SVM"), linear regression models (referred to as “LR”); data-insensitive models include Decision Tree model and Random Forest. Models, etc.; the models with excellent performance in model integration include AdaBoost algorithm and XGBoost algorithm.
  • SVM support vector machines
  • LR linear regression models
  • AdaBoost algorithm AdaBoost algorithm
  • XGBoost XGBoost algorithm
  • the classification decision tree model is a tree structure that describes the classification of instances.
  • the decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes, which are represented by internal nodes. A feature or attribute, a leaf node represents a class.
  • the main advantage of the decision tree model is that the model is readable and the classification speed is fast.
  • Decision tree learning algorithm is usually a recursive selection of optimal features, and the training data is segmented according to the optimal features, so that each sub-data set has a best classification process. After the decision tree algorithm, a prediction result of ⁇ 1 is obtained .
  • Random forest is a versatile machine learning algorithm, which refers to a classifier that uses multiple trees to train and predict samples, and can perform regression and classification tasks. It is also one of the important methods in integrated learning. It can show its talents when integrating several inefficient models into an efficient model, so that the final classification effect can exceed an algorithm of a single model.
  • Each splitting process of the subtree in the random forest is to randomly select certain features from all the features to be selected, and then select the optimal feature among the randomly selected features so that the decision trees in the random forest can be different from each other , To enhance the diversity of the system, thereby improving the classification performance.
  • the prediction result of the model is ⁇ 2 .
  • AdaBoost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) against the same training set, and then combine these weak classifiers to form a stronger final classifier (strong classifier).
  • AdaBoost learns the basic classifier G t (x) by using the current distribution D t (x) weighted training data set, and calculates the coefficient ⁇ i of the basic classifier G t (x). ⁇ i represents G t (x) at the end Importance in the classifier. Then construct a linear combination of basic classifiers:
  • the XGBoost algorithm is a tree-based boosting algorithm. The biggest feature is that it can automatically use the multithreading of the CPU for parallelization, and at the same time improve the algorithm to improve the accuracy. We use XGBoost to get the prediction result as ⁇ 4 .
  • the weights of the four models are given as ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 . If calculated in model 1, the results are ⁇ 1 , ⁇ 2 , ⁇ 3 , ⁇ 4 . Then the final judgment is:
  • f(x) If the value of f(x) exceeds the set threshold, it is judged that the sample is a positive example, and if it does not exceed it, it is a negative example, thereby completing the judgment. (The threshold is artificially set and generally exceeds 0.7 to be considered credible.)
  • the data structure mixing the original text and numbers is unified into a line of multi-dimensional feature codes. Under the mixed data structure of text and numbers, they are unified into data with the same attributes, which can be processed by the classifier together.
  • the One-Hot Encoding encoding of this feature code is in the front, and the Label Encoding encoding is in the back. And it does not destroy the association of the original data, that is, the independent ones remain independent, and the related ones still guarantee their association.
  • the generated multi-dimensional signatures are used to continuously create "labeled" data in GAN, which is used to balance the problem of low accuracy due to the imbalance of positive and negative samples in the sample set. .
  • the behavior prediction method provided in this application combines the sample data with One-Hot Encoding and Label Encoding to multi-dimensional feature codes, and then uses the generated anti-network to enrich the existing label data, and finally uses the multi-model fusion weights to classify the data and output . It avoids the one-size-fits-all data, makes the effective features of the data fully utilized, and generates the defect that the anti-network balances the sample imbalance, making the data classification more accurate and effectively predicting user behavior.

Abstract

Disclosed is a behavior prediction method, belonging to the technical field of information. A behavior of a user is predicted by using data; however, attributes of the existing data are completely different, and actually, these pieces of data possibly have no association; therefore, a one-size-fits-all data processing method is not suitable for precise prediction under big data at present. The method comprises: step 1, fusing One-Hot Encoding and Label Encoding into a multi-dimensional feature code; step 2, representing collected sample data as the multi-dimensional feature code in step 1; step 3, enriching existing label data by using a generative adversarial network; step 4, integrating a plurality of models, carrying out repeated training to generate a weight factor of each model, and then classifying the data obtained in step 3 after an integrated model with weights is obtained; and step 5, outputting a predicted behavior. By means of the method, data classification is more accurate, and a behavior of a user is effectively predicted.

Description

一种行为预测方法A behavior prediction method 技术领域Technical field
本申请属于信息技术领域,特别是涉及一种行为预测方法。This application belongs to the field of information technology, and particularly relates to a behavior prediction method.
背景技术Background technique
特征编码方法由来已久,常见诸于机器学习中。特征编码大体上分为两类,其一是One-Hot Encoding,其二是Label Encoding。两种方法中,第一种适合用于毫无关联的数据,用做独立分析,这样的特征编码就能保证其独立同分布特性;第二种Label Encoding,适合用于数据相当巨大的情况下,为了防止出现维度灾难从而简化数据。生成对抗网络(Generative Adversarial Networks,即GAN)广泛的被用于机器学习中无监督算法中。Feature encoding methods have a long history and are commonly used in machine learning. Feature encoding is roughly divided into two categories, one is One-Hot Encoding, and the other is Label Encoding. Of the two methods, the first is suitable for unrelated data and used for independent analysis. Such feature encoding can ensure its independent and identical distribution characteristics; the second Label Encoding is suitable for the case where the data is quite huge , In order to prevent dimensional disasters and simplify data. Generative adversarial networks (GAN) are widely used in unsupervised algorithms in machine learning.
通过数据对用户的行为进行预测,可是现在拥有的数据很多都是用户属性客观数据以及一些其他行为数据,这些数据之间属性完全不同,不能做到很好的统一,如果将它们转成某种十进制数字,会给这些数据强行加上某种数值关联,但实际上这些数据可能本身没有关联,一刀切式的数据处理方法不适合现在大数据下的精准预测。Predict the user's behavior through data, but many of the data now available are objective attribute data of the user and some other behavior data. The attributes of these data are completely different, and they cannot be well unified. If they are converted into some kind of Decimal numbers will forcibly add some kind of numerical correlation to these data, but in fact the data may not be related by itself. The one-size-fits-all data processing method is not suitable for accurate prediction under big data now.
发明内容Summary of the invention
1.要解决的技术问题1. Technical problems to be solved
基于通过数据对用户的行为进行预测,可是现在拥有的数据很多都是用户属性客观数据以及一些其他行为数据,这些数据之间属性完全不同,不能做到很好的统一,如果将它们转成某种十进制数字,会给这些数据强行加上某种数值关联,但实际上这些数据可能本身没有关联,一刀切式的数据处理方法不适合现在大数据下的精准预测的问题,本申请提供了一种行为预测方法。Based on the data to predict the user's behavior, but many of the data now available are objective attribute data of the user and some other behavior data. The attributes of these data are completely different and cannot be well unified. If they are converted into a certain This kind of decimal number will forcibly add some kind of numerical correlation to these data, but in fact the data may not be related by itself. The one-size-fits-all data processing method is not suitable for the problem of accurate prediction under big data now. This application provides a Behavior prediction method.
2.技术方案2. Technical solutions
为了达到上述的目的,本申请提供了一种行为预测方法,所述方法包括如下步骤:In order to achieve the above objective, the present application provides a behavior prediction method. The method includes the following steps:
步骤1、将One-Hot Encoding编码和Label Encoding编码融合成多维特征码;Step 1. Integrate One-Hot Encoding and Label Encoding into multi-dimensional feature codes;
步骤2、将采集的样本数据表示为步骤1中的多维特征码;Step 2. Represent the collected sample data as the multi-dimensional feature code in Step 1;
步骤3、采用生成对抗网络丰富已有标签数据;Step 3. Enrich the existing tag data by generating an adversarial network;
步骤4、将多个模型集成在一起,反复训练,从而产生每个模型的权重因子,然后得到一个带权重的集成模型后,对步骤3得到的数据进行分类;Step 4. Integrate multiple models together and train repeatedly to generate a weighting factor for each model. After obtaining an integrated model with weights, classify the data obtained in step 3;
步骤5、输出预测行为。Step 5. Output prediction behavior.
可选地,所述步骤1中One-Hot Encoding编码部分的数据是采用了二进制数字来表示的 一系列相同属性的数字;所述数据只是表示一个客观事实,并没有数值含义。Optionally, the data of the One-Hot Encoding encoding part in step 1 is a series of numbers with the same attribute expressed by binary numbers; the data only represents an objective fact and has no numerical meaning.
可选地,所述步骤1中Label Encoding编码部分的数据表示一个权重或者数值,具有数学意义;所述数据之间有关联,类内有联系;所述数据采用十进制数表示不超过两位数。Optionally, the data in the Encoding part of the Label in step 1 represents a weight or value, which has a mathematical meaning; there is a correlation between the data, and there is a connection within the class; the data is expressed in decimal not exceeding two digits .
可选地,所述步骤3包括通过生成器反复生成假数据,然后通过判别器鉴别生成数据是否是假数据,不断博弈,直到再也无法判断出这是假数据还是真实数据;将这些制造的数据用以平衡样本数据集。Optionally, the step 3 includes repeatedly generating fake data through a generator, and then discriminating whether the generated data is fake data through a discriminator, and continually playing games until it is no longer possible to determine whether this is fake data or real data; The data is used to balance the sample data set.
可选地,所述制造的数据的判别公式为:Optionally, the discrimination formula of the manufactured data is:
Figure PCTCN2019121492-appb-000001
Figure PCTCN2019121492-appb-000001
其中,D(x)表示判别器判断下的数据取自原始数据的概率;D(G(z))表示判别器判断下的数据取自生成器的概率;x~P daca表示数据来自原始数据;z~P Z(z)表示数据来自生成器;
Figure PCTCN2019121492-appb-000002
代表求其均值;
Among them, D(x) represents the probability that the data judged by the discriminator is taken from the original data; D(G(z)) represents the probability that the data judged by the discriminator is taken from the generator; x~P daca means that the data comes from the original data ; Z~P Z(z) indicates that the data comes from the generator;
Figure PCTCN2019121492-appb-000002
Represent the average value;
Min(G)Max(D)P(D,G)表示在当前生成器和判别器P(D,G)情况下,保证最大化判别器Max(D)的同时做到Min(G)生成器误差最小。Min(G)Max(D)P(D,G) means that in the case of the current generator and discriminator P(D,G), it is guaranteed to maximize the discriminator Max(D) while achieving the Min(G) generator The error is minimal.
可选地,所述步骤4包括对不同的数据采用不同的模型进行训练然后找出所有训练中表现最好的几种模型,依据其分类得到的正确率,根据准确度从大到小赋予从大到小的权重比例。Optionally, the step 4 includes training different models on different data and then finding out the best performing models in all the trainings, according to the classification accuracy, and according to the accuracy from large to small. Large to small weight ratio.
可选地,所述权重比例之和为1。Optionally, the sum of the weight ratios is 1.
可选地,所述表现最好的几种模型包括分类决策树模型、随机森林模型、AdaBoost模型和XGBoost模型。Optionally, the best performing models include a classification decision tree model, a random forest model, an AdaBoost model, and an XGBoost model.
可选地,所述模型判别式为:Optionally, the model discriminant is:
Figure PCTCN2019121492-appb-000003
Figure PCTCN2019121492-appb-000003
其中,ω 1,ω 2,ω 3,ω 4为四种模型赋予的权重,θ 1,θ 2,θ 3,θ 4为分类决策树模型得到的预测结果; Among them, ω 1 , ω 2 , ω 3 , ω 4 are the weights given by the four models, θ 1 , θ 2 , θ 3 , θ 4 are the prediction results obtained by the classification decision tree model;
若f(x)的值超过设定的阈值,则判断该样本是正例,未超过则为负例。If the value of f(x) exceeds the set threshold, it is judged that the sample is a positive example, and if it does not exceed it, it is a negative example.
可选地,所述行为包括金融投资行为。Optionally, the behavior includes financial investment behavior.
3.有益效果3. Beneficial effect
与现有技术相比,本申请提供的一种行为预测方法的有益效果在于:Compared with the prior art, the beneficial effects of a behavior prediction method provided by this application are:
本申请提供的行为预测方法,将样本数据采用One-Hot Encoding编码和Label Encoding编码融合成多维特征码,然后采用生成对抗网络丰富已有标签数据,最后使用多模型融合权重对数据进行分类后输出。避免了了数据一刀切,使得数据有效特征被充分利用,生成对抗网络平衡了样本不平衡的缺陷,使得数据分类更加准确,有效对用户行为进行预测。The behavior prediction method provided in this application combines the sample data with One-Hot Encoding and Label Encoding to multi-dimensional feature codes, and then uses the generated anti-network to enrich the existing label data, and finally uses the multi-model fusion weights to classify the data and output . It avoids the one-size-fits-all data, makes the effective features of the data fully utilized, and generates the defect that the anti-network balances the sample imbalance, making the data classification more accurate and effectively predicting user behavior.
附图说明BRIEF DESCRIPTION
图1是本申请的一种行为预测方法流程图。FIG. 1 is a flowchart of a behavior prediction method of the present application.
具体实施方式detailed description
在下文中,将参考附图对本申请的具体实施例进行详细地描述,依照这些详细的描述,所属领域技术人员能够清楚地理解本申请,并能够实施本申请。在不违背本申请原理的情况下,各个不同的实施例中的特征可以进行组合以获得新的实施方式,或者替代某些实施例中的某些特征,获得其它优选的实施方式。Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings. According to these detailed descriptions, those skilled in the art can clearly understand the present application and can implement the present application. Without violating the principles of the present application, the features in the different embodiments may be combined to obtain new implementations, or to replace certain features in certain embodiments to obtain other preferred implementations.
参见图1,本申请提供一种行为预测方法,所述方法包括如下步骤:Referring to FIG. 1, this application provides a behavior prediction method. The method includes the following steps:
步骤1、将One-Hot Encoding编码和Label Encoding编码融合成多维特征码;Step 1. Integrate One-Hot Encoding and Label Encoding into multi-dimensional feature codes;
步骤2、将采集的样本数据表示为步骤1中的多维特征码;Step 2. Represent the collected sample data as the multi-dimensional feature code in Step 1;
步骤3、采用生成对抗网络丰富已有标签数据;Step 3. Enrich the existing tag data by generating an adversarial network;
步骤4、将多个模型集成在一起,反复训练,从而产生每个模型的权重因子,然后得到一个带权重的集成模型后,对步骤3得到的数据进行分类;Step 4. Integrate multiple models together and train repeatedly to generate a weighting factor for each model. After obtaining an integrated model with weights, classify the data obtained in step 3;
步骤5、输出预测行为。Step 5. Output prediction behavior.
可选地,所述步骤1中One-Hot Encoding编码部分的数据是采用了二进制数字来表示的一系列相同属性的数字;所述数据只是表示一个客观事实,并没有数值含义。Optionally, the data of the One-Hot Encoding encoding part in step 1 is a series of numbers with the same attribute expressed by binary numbers; the data only represents an objective fact and has no numerical meaning.
可选地,所述步骤1中Label Encoding编码部分的数据表示一个权重或者数值,具有数学意义;所述数据之间有关联,类内有联系;所述数据采用十进制数表示不超过两位数。Optionally, the data in the Encoding part of the Label in step 1 represents a weight or value, which has a mathematical meaning; there is a correlation between the data, and there is a connection within the class; the data is expressed in decimal numbers not exceeding two digits .
可选地,所述步骤3包括通过生成器反复生成假数据,然后通过判别器鉴别生成数据是否是假数据,不断博弈,直到再也无法判断出这是假数据还是真实数据;将这些制造的数据用以平衡样本数据集。Optionally, the step 3 includes repeatedly generating fake data through a generator, and then discriminating whether the generated data is fake data through a discriminator, and continually playing games until it is no longer possible to determine whether this is fake data or real data; The data is used to balance the sample data set.
可选地,所述制造的数据的判别公式为:Optionally, the discrimination formula of the manufactured data is:
Figure PCTCN2019121492-appb-000004
Figure PCTCN2019121492-appb-000004
其中,D(x)表示判别器判断下的数据取自原始数据的概率;D(G(z))表示判别器判断下的数据取自生成器的概率;x~P daca表示数据来自原始数据;z~P Z(z)表示数据来自生成器;
Figure PCTCN2019121492-appb-000005
代表求其均值;
Among them, D(x) represents the probability that the data judged by the discriminator is taken from the original data; D(G(z)) represents the probability that the data judged by the discriminator is taken from the generator; x~P daca means that the data comes from the original data ; Z~P Z(z) indicates that the data comes from the generator;
Figure PCTCN2019121492-appb-000005
Represent the average value;
Min(G)Max(D)P(D,G)表示在当前生成器和判别器P(D,G)情况下,保证最大化判别器Max(D)的同时做到Min(G)生成器误差最小。Min(G)Max(D)P(D,G) means that in the case of the current generator and discriminator P(D,G), it is guaranteed to maximize the discriminator Max(D) while achieving the Min(G) generator The error is minimal.
可选地,所述步骤4包括对不同的数据采用不同的模型进行训练然后找出所有训练中表现最好的几种模型,依据其分类得到的正确率,根据准确度从大到小赋予从大到小的权重比 例。Optionally, the step 4 includes training different models on different data and then finding out the best performing models in all the trainings, according to the classification accuracy, and according to the accuracy from large to small. Large to small weight ratio.
可选地,所述权重比例之和为1。Optionally, the sum of the weight ratios is 1.
可选地,所述表现最好的几种模型包括分类决策树模型、随机森林模型、AdaBoost模型和XGBoost模型。Optionally, the best performing models include a classification decision tree model, a random forest model, an AdaBoost model, and an XGBoost model.
可选地,所述模型判别式为:Optionally, the model discriminant is:
f(x)=(ω 11223344)/4 f(x)=(ω 11223344 )/4
其中,ω 1,ω 2,ω 3,ω 4为四种模型赋予的权重,θ 1,θ 2,θ 3,θ 4为分类决策树模型得到的预测结果; Among them, ω 1 , ω 2 , ω 3 , ω 4 are the weights given by the four models, θ 1 , θ 2 , θ 3 , θ 4 are the prediction results obtained by the classification decision tree model;
若f(x)的值超过设定的阈值,则判断该样本是正例,未超过则为负例。If the value of f(x) exceeds the set threshold, it is judged that the sample is a positive example, and if it does not exceed it, it is a negative example.
可选地,所述行为包括金融投资行为。Optionally, the behavior includes financial investment behavior.
实施例Examples
本申请以金融投资行为为例进行说明:This application uses financial investment behavior as an example to illustrate:
长期以来,金融机构饱受客户数据分析的困扰,金融机构想利用手中数据来进行二分类的预测,即用户会或者不会进行某种行为。例如:会不会存钱、会不会办理信用卡、会不会贷款等问题。For a long time, financial institutions have been plagued by customer data analysis. Financial institutions want to use the data in their hands to make binary classification predictions, that is, whether users will or will not perform certain actions. For example: will you save money, will you apply for a credit card, will you make a loan, etc.
金融机构有了用户数据后,面临的第一个问题是找不到一种具体问题的数据表示形式,即是用具体数值直接来做分析还是转换成其他数据格式来做数据分析,本申请首先给出一种基于数据的混合特征编码方法,考虑到两种编码方法应用的不用场景,同时认真分析了其他类别数据类间和类内的关联,对于那些类内没有关联,类间不影响的数据,采用One-Hot Encoding方案进行编码,剩余部分数据间有关联,受数值影响的,采用Label Encoding编码。将两种编码融合,对于每一个个体,形成一列长序列包括One-Hot Encoding和Label Encoding编码的特征编码序列。统一对数据做编码转换,有了这种编码,即可以对已有数据进行统一分析,不再用做其他转换,实现转换后的数据输入直接对应于一个分类算法从而输出。After the financial institution has user data, the first problem it faces is that it cannot find a specific form of data representation, that is, whether to use specific values for analysis or to convert to other data formats for data analysis. This application first A data-based mixed feature coding method is given. Considering the different scenarios where the two coding methods are applied, at the same time, the association between other categories of data within and within the category is carefully analyzed. The data is encoded using the One-Hot Encoding scheme, and the rest of the data is related. If it is affected by the value, Label Encoding is used. The two codes are fused, and for each individual, a long sequence of feature codes including One-Hot Encoding and Label Encoding codes is formed. The data is uniformly encoded and converted. With this encoding, the existing data can be analyzed uniformly and no longer used for other transformations. The converted data input directly corresponds to a classification algorithm to output.
首先将数据分为可以用One-Hot Encoding编码的部分和要用Label Encoding编码的部分。First divide the data into a part that can be encoded with One-Hot Encoding and a part that needs to be encoded with Label Encoding.
其中One-Hot Encoding编码的数据要满足以下条件:One-Hot Encoding data must meet the following conditions:
数据本身是文字,只是用了二进制数字来表示这一系列相同属性的数字,也就是说数据本身没有数学属性,只是用编码来表示:比如性别:男、女分别用10、00来表示;例如一周七天从周一到周天可以分别表示为:000,001,010,011,100,101,110,111。The data itself is text, but binary numbers are used to represent this series of numbers with the same attributes, that is to say, the data itself has no mathematical attributes, but is represented by coding: for example, gender: male and female are represented by 10 and 00 respectively; for example Seven days a week from Monday to Sunday can be expressed as: 000, 001, 010, 011, 100, 101, 110, 111.
数据是数字,但只是表示一个客观事实,并没有数值含义:比如年龄,23,25,62这种就可以用不同的0和1的编码组合来表示,可以表示为十进制对应的二进制编码,若出现编码 后位数不同,则在高位补0,一直补到使得所有同属性的数据的位数相同为止。本方法使用的就是这种编码方式。The data is a number, but it only represents an objective fact and has no numerical meaning: for example, age, 23, 25, 62 can be represented by different 0 and 1 code combinations, which can be expressed as the binary code corresponding to decimal, if If the number of digits after encoding differs, 0 is added to the upper digits until the digits of all data with the same attribute are the same. This method uses this encoding method.
Label Encoding编码要遵从以下要求:Label Encoding should comply with the following requirements:
数据本身表示一个权重或者数值,是具有数学意义的。例如一个用户有几张银行卡:总共出现了1、2、3……7等七种可能,那么Label Encoding编码直接就是1、2、3……7。The data itself represents a weight or value, which is mathematically meaningful. For example, a user has several bank cards: a total of seven possibilities of 1, 2, 3...7, etc., then the Label Encoding is directly 1, 2, 3...7.
采用Label Encoding编码的数据必须是数据之间有关联的,类内有联系。例如一个用户的行为影响另外一个用户的行为,那么这种一般采用这种编码方式。The data encoded with LabelEncoding must be related to each other and related within the class. For example, the behavior of one user affects the behavior of another user, so this coding method is generally adopted.
采用Label Encoding编码的数据用十进制数表示不超过两位数,也就是最多表示到99,,表示范围为0~99(此为本方法要求)。Data encoded with Label Encoding is expressed in decimal not exceeding two digits, that is, up to 99, and the range is 0 to 99 (this is a requirement of this method).
结合以上两种编码方式,本方法规定One-Hot Encoding编码在前,Label Encoding编码在后。即:Combining the above two encoding methods, this method specifies that One-Hot Encoding is first and Label Encoding is last. which is:
One-Hot EncodingOne-HotEncoding Label EncodingLabel Encoding
表1编码方式Table 1 encoding method
例如以下:For example the following:
110011110011 0011000110 111001111001 0001000010 1110111101 11 23twenty three 7878 6161 24twenty four
表2编码方式Table 2 Encoding
之所以称之为多维特征码,就是因为这种编码方式融合了两种主要的编码方式,是因为考虑到了数据的两大特性,其一就是One-Hot Encoding编码,若数据间没有关联且出现了文字特征,那么就使用这种编码方式,编码的位数(即长度)视情况而定,没有硬性要求,只要能区分出来类内的各种特征即可,但必须保证类内特征的编码长度均相同。若要使用Label Encoding,则是数据的数值影响到了特征结果,所以此时就要保留这部分影响特征的数据,但若位数超过两位,则用第一种表示成对应数值更为方便,所以,两位对第二种编码来说就是合适的。The reason why it is called a multi-dimensional feature code is because this encoding method combines two main encoding methods, because considering the two major characteristics of the data, one is One-Hot Encoding encoding, if there is no correlation between the data and appears If the text features are used, then this encoding method is used. The number of digits (ie, the length) of the encoding depends on the situation. There is no hard requirement, as long as the various features in the class can be distinguished, but the encoding of the features in the class must be guaranteed. The length is the same. If you want to use Label Encoding, the value of the data affects the feature result, so you need to keep this part of the data that affects the feature, but if the number of digits exceeds two, it is more convenient to express the corresponding value as the first type. Therefore, two bits are suitable for the second encoding.
其次,本申请考虑到银行标记数据量较少,在特征编码转换后,可以利用生成对抗网络(简称为“GAN”)丰富为数不多的标记数据即产生高度迷惑性的假样本,这些假样本用来增强标记样本数据不多的弊端,机器学习中,如果标记样本和非标记样本相差悬殊,则会造成训练产生严重的欠拟合,严重影响最终分类精度,这样就可以批量增强标记数据,利用GAN生成了相当数量的假标记样本实现数据的平衡,通过实验发现,数据的平衡对最终的结果影响是显著的,平衡后的数据明显有助于提升判别准确率。Secondly, this application considers that the amount of bank tag data is small. After the feature code conversion, the rich adversarial network (abbreviated as "GAN") can be used to generate a rich number of tag data to generate highly confusing fake samples. These fake samples It is used to enhance the disadvantage of not much labeled sample data. In machine learning, if the difference between the labeled sample and the non-labeled sample is very different, it will cause serious underfitting in training and seriously affect the final classification accuracy, so that the labeled data can be enhanced in batches. A considerable number of false-labeled samples were generated using GAN to achieve data balance. Through experiments, it was found that the balance of data has a significant impact on the final result. The balanced data obviously helps to improve the accuracy of discrimination.
GAN通常由两部分构成,第一部分是生成器,第二部分是判别器。生成器用于反复生成 假数据,判别器用于鉴别生成器给它的数据是否是假数据,两个部分不断博弈,直到判别器再也无法判断出这是假数据还是真实数据,那么就完成了这样一个“造假”的过程。完成表1所示的编码后,会生成n行如表2一样的编码。每一行代表一个用户的特征编码。我们将已经标好标签(即判定为具有某种行为)的数据按照表1方式整理,将这张特征表输入给GAN,然后GAN经过上述的过程再给我们制造出很多带标签的但是是人为制造的数据以此来平衡我们的样本集。GAN usually consists of two parts, the first part is the generator and the second part is the discriminator. The generator is used to generate fake data repeatedly, and the discriminator is used to identify whether the data given to it by the generator is fake data. The two parts continue to play games until the discriminator can no longer judge whether this is fake data or real data, then this is done. A "counterfeit" process. After the coding shown in Table 1 is completed, n lines of coding as in Table 2 will be generated. Each line represents a user's feature code. We organize the data that has been labeled (that is, it is determined to have a certain behavior) according to Table 1, input this feature table to GAN, and then GAN produces many labeled but artificially for us through the above process The manufactured data is used to balance our sample set.
假设整理出了一张m*n的表格,表示一个有m个样本,n个小类别的标记数据。将其传送给GAN网络,在这个网络中,用x代表这张网格上的数据,生成器学习一种数据分布P g,因为数据分布中存在噪声,定义一个噪声分布函数:P z(Z),这样是为了保证算法最终的鲁棒性,网络中有原有的参数θ g,故而定义出了G(z,θ g)为原有数据的一个映射,这就是生成器生成假数据的原理和方法。判别器D(x)用来表示数据来自于x的概率,训练D(x)使其能够最大能力即最大概率识别出数据是来自于自身训练数据集还是G(x)。同时也使得G所表示的log(1-D(G(z)))最小,这个公式最内层嵌套的是生成器,若要使得该公式最小,则内层的G(G(z))必须最大,这样的含义是判别器最大化概率准确识别来自于生成器的内容。将以上两个内容结合,我们得到: Suppose that a m*n table is compiled, which represents a labeled data with m samples and n small categories. It is transmitted to the GAN network. In this network, x is used to represent the data on this grid. The generator learns a data distribution P g . Because there is noise in the data distribution, a noise distribution function is defined: P z (Z ), this is to ensure the final robustness of the algorithm, the original parameter θ g in the network, so G (z, θ g ) is defined as a mapping of the original data, this is the generator to generate fake data Principles and methods. The discriminator D(x) is used to indicate the probability that the data comes from x, and training D(x) to maximize the ability, that is, the maximum probability to identify whether the data comes from its own training data set or G(x). At the same time, it also minimizes the log(1-D(G(z))) represented by G. The innermost layer of this formula is the generator. To minimize the formula, the inner G(G(z)) ) Must be maximum, which means that the discriminator maximizes the probability of accurately identifying the content from the generator. Combining the above two contents, we get:
Figure PCTCN2019121492-appb-000006
Figure PCTCN2019121492-appb-000006
算法反复迭代直至其收敛或者满足Min(G)Max(D)小于某个特定的值之后,即完成了一个生成器和判别器的构造,此时的生成器生成的数据即为可以使用的带标签的假数据。以此完成平衡样本中正负样例差别过大的弊端。After the algorithm iterates repeatedly until it converges or satisfies that Min(G)Max(D) is less than a certain value, the construction of a generator and discriminator is completed. The data generated by the generator at this time is the band that can be used. The fake data of the label. In this way, the disadvantages of the positive and negative samples in the balanced sample are too different.
最后,将多模型融合,机器学习中有很多用于分类的算法,这些模型有决策树模型、随机森林模型和AdaBoost模型等等,本申请将多种模型融合,利用投票规则,最终并联成为一个大的分类器来做分类利用权重方法融合为一个强模型来做分类。Finally, multi-model fusion, there are many algorithms for classification in machine learning, these models include decision tree model, random forest model and AdaBoost model, etc. In this application, a variety of models are fused, using voting rules, and finally connected in parallel A large classifier is used for classification, and the weight method is used to merge into a strong model for classification.
有些模型对数据是敏感的,但是有些模型对数据不敏感。例如对数据敏感的模型有支持向量机(简称为“SVM”)、线性回归模型(简称为“LR”);对数据不敏感的模型有决策树(Decision Tree)模型、随机森林(Random Forest)模型等;模型集成上表现优异的模型有AdaBoost算法和XGBoost算法。考虑到我们的数据也是分成了无关联和有关联的两类。故而我们采用投票权重的办法来计算。多模型融合就是对不同的数据采用不同的模型进行训练然后找出所有训练中表现最好的四种模型,依据其分类得到的正确率,根据准确度从大到小赋予从大到小的权重比例,这些权重比例相加为1.Some models are sensitive to data, but some models are not sensitive to data. For example, data-sensitive models include support vector machines (referred to as "SVM"), linear regression models (referred to as "LR"); data-insensitive models include Decision Tree model and Random Forest. Models, etc.; the models with excellent performance in model integration include AdaBoost algorithm and XGBoost algorithm. Considering that our data is also divided into two types: unrelated and related. Therefore, we use the method of voting weight to calculate. Multi-model fusion is to use different models to train different data and then find the four models with the best performance in all trainings. According to the correct rate of their classification, according to the accuracy from big to small, give the weight from big to small Proportion, these weight ratios add up to 1.
根据多次试验证实,对于我们的数据和要分析的金融客户的行为来说,以下四种模型融合在一起效果最好:(因以下四种方法均有成熟体系和表达,故不再赘述。θ为预测是“是”的概率。)According to many experiments, the following four models are the best for our data and the behavior of financial customers to be analyzed: (Because the following four methods have mature systems and expressions, they will not be repeated here. θ is the probability that the prediction is "yes.")
Decision Tree:Decision Tree:
分类决策树模型是一种描述对实例进行分类的树形结构,决策树由节点(node)和有向边(directed edge)组成,结点有两种类型:内部节点和叶节点,内部节点表示一个特征或属性,叶节点表示一个类。决策树模型主要优点是模型具有可读性,分类速度快。决策树学习算法通常是一个递归地选择最优特征,并根据最优的特征对训练数据进行分割,使得对各个子数据集有一个最好的分类过程。经过决策树算法,得到一个预测结果为θ 1The classification decision tree model is a tree structure that describes the classification of instances. The decision tree consists of nodes and directed edges. There are two types of nodes: internal nodes and leaf nodes, which are represented by internal nodes. A feature or attribute, a leaf node represents a class. The main advantage of the decision tree model is that the model is readable and the classification speed is fast. Decision tree learning algorithm is usually a recursive selection of optimal features, and the training data is segmented according to the optimal features, so that each sub-data set has a best classification process. After the decision tree algorithm, a prediction result of θ 1 is obtained .
Random Forest:RandomForest:
随机森林是一种多功能的机器学习算法,指的是利用多棵树对样本进行训练并预测的一种分类器,能够执行回归和分类的任务。它也是集成学习中的重要方法之一,可以在将几个低效模型整合为一个高效模型时大显身手,使得最终的分类效果能够超过单个模型的一种算法。随机森林中的子树的每一个分裂过程是从所有的待选特征中随机选取一定的特征,之后再在随机选取的特征中选取最优的特征从而使得随机森林中的决策树都能够彼此不同,提升系统的多样性,从而提升分类性能。经过随机森林算法,得到该模型的预测结果为θ 2Random forest is a versatile machine learning algorithm, which refers to a classifier that uses multiple trees to train and predict samples, and can perform regression and classification tasks. It is also one of the important methods in integrated learning. It can show its talents when integrating several inefficient models into an efficient model, so that the final classification effect can exceed an algorithm of a single model. Each splitting process of the subtree in the random forest is to randomly select certain features from all the features to be selected, and then select the optimal feature among the randomly selected features so that the decision trees in the random forest can be different from each other , To enhance the diversity of the system, thereby improving the classification performance. Through the random forest algorithm, the prediction result of the model is θ 2 .
AdaBoost是一种迭代算法,其核心思想是针对同一个训练集训练不同的分类器(弱分类器),然后把这些弱分类器集合起来,构成一个更强的最终分类器(强分类器)。AdaBoost通过使用当前分布D t(x)加权的训练数据集,学习基本分类器G t(x),计算基本分类器G t(x)的系数α ii表示G t(x)在最终分类器中的重要性。然后构建基本分类器的线性组合: AdaBoost is an iterative algorithm whose core idea is to train different classifiers (weak classifiers) against the same training set, and then combine these weak classifiers to form a stronger final classifier (strong classifier). AdaBoost learns the basic classifier G t (x) by using the current distribution D t (x) weighted training data set, and calculates the coefficient α i of the basic classifier G t (x). α i represents G t (x) at the end Importance in the classifier. Then construct a linear combination of basic classifiers:
Figure PCTCN2019121492-appb-000007
Figure PCTCN2019121492-appb-000007
得到最终分类器对应的表达式为:The expression corresponding to the final classifier is:
Figure PCTCN2019121492-appb-000008
Figure PCTCN2019121492-appb-000008
由此可得,此时模型预测的结果的θ 3From this, θ 3 , the result of model prediction at this time.
XGBoost算法是基于树的boosting算法,最大的特点在于,它能够自动利用CPU的多线程进行并行,同时在算法上加以改进提高了精度。我们使用XGBoost得到预测结果为θ 4The XGBoost algorithm is a tree-based boosting algorithm. The biggest feature is that it can automatically use the multithreading of the CPU for parallelization, and at the same time improve the algorithm to improve the accuracy. We use XGBoost to get the prediction result as θ 4 .
按照准确率大小依次对这四种模型赋予权重为ω 1,ω 2,ω 3,ω 4。若在模型1计算得到其结果为θ 1,θ 2,θ 3,θ 4。则最终的判别为: According to the accuracy rate, the weights of the four models are given as ω 1 , ω 2 , ω 3 , ω 4 . If calculated in model 1, the results are θ 1 , θ 2 , θ 3 , θ 4 . Then the final judgment is:
f(x)=(ω 11223344)/4 f(x)=(ω 11223344 )/4
若f(x)的值超过设定的阈值,则判断该样本是正例,未超过则为负例,由此完成判断。(阈值人为设定,一般超过0.7认为是可信的。)If the value of f(x) exceeds the set threshold, it is judged that the sample is a positive example, and if it does not exceed it, it is a negative example, thereby completing the judgment. (The threshold is artificially set and generally exceeds 0.7 to be considered credible.)
将原有的文本和数字混合的数据结构,统一的变成了一行多维特征码。文本和数字混合的数据结构下将它们统一变成具有相同属性的数据,能够一起被分类器处理。这个特征码的One-Hot Encoding编码在前,Label Encoding编码在后。而且不破坏原有数据的关联,即独立的依旧保持独立,有关联的依旧保证其关联。The data structure mixing the original text and numbers is unified into a line of multi-dimensional feature codes. Under the mixed data structure of text and numbers, they are unified into data with the same attributes, which can be processed by the classifier together. The One-Hot Encoding encoding of this feature code is in the front, and the Label Encoding encoding is in the back. And it does not destroy the association of the original data, that is, the independent ones remain independent, and the related ones still guarantee their association.
根据带标签即人工标记数据少的特点,利用生成的多维特征码,在GAN中不断制造带“标签”的数据,这样用来平衡由于样本集正负样本不平衡带来的精确度低的问题。According to the characteristics of labeling, that is, less manual labeling data, the generated multi-dimensional signatures are used to continuously create "labeled" data in GAN, which is used to balance the problem of low accuracy due to the imbalance of positive and negative samples in the sample set. .
根据权重和数据敏感性不同,设计了一个融合多个模型的大模型,这个大模型能够兼容各种敏感和不敏感数据,使得其分类结果是鲁棒的。这四种办法能够有效预测在几种二分类金融行为。According to the different weights and data sensitivities, a large model with multiple models is designed. This large model is compatible with various sensitive and insensitive data, making its classification results robust. These four methods can effectively predict the financial behavior in several binary categories.
本申请提供的行为预测方法,将样本数据采用One-Hot Encoding编码和Label Encoding编码融合成多维特征码,然后采用生成对抗网络丰富已有标签数据,最后使用多模型融合权重对数据进行分类后输出。避免了了数据一刀切,使得数据有效特征被充分利用,生成对抗网络平衡了样本不平衡的缺陷,使得数据分类更加准确,有效对用户行为进行预测。The behavior prediction method provided in this application combines the sample data with One-Hot Encoding and Label Encoding to multi-dimensional feature codes, and then uses the generated anti-network to enrich the existing label data, and finally uses the multi-model fusion weights to classify the data and output . It avoids the one-size-fits-all data, makes the effective features of the data fully utilized, and generates the defect that the anti-network balances the sample imbalance, making the data classification more accurate and effectively predicting user behavior.
尽管在上文中参考特定的实施例对本申请进行了描述,但是所属领域技术人员应当理解,在本申请公开的原理和范围内,可以针对本申请公开的配置和细节做出许多修改。本申请的保护范围由所附的权利要求来确定,并且权利要求意在涵盖权利要求中技术特征的等同物文字意义或范围所包含的全部修改。Although the present application has been described above with reference to specific embodiments, those skilled in the art should understand that, within the principle and scope of the present disclosure, many modifications can be made to the configuration and details disclosed in the present application. The protection scope of the present application is determined by the appended claims, and the claims are intended to cover all modifications included in the literal meaning or scope of equivalents of the technical features in the claims.

Claims (10)

  1. 一种行为预测方法,其特征在于:所述方法包括如下步骤:A behavior prediction method, characterized in that the method includes the following steps:
    步骤1、将One-Hot Encoding编码和Label Encoding编码融合成多维特征码;Step 1. Integrate One-Hot Encoding and Label Encoding into multi-dimensional feature codes;
    步骤2、将采集的样本数据表示为步骤1中的多维特征码;Step 2. Represent the collected sample data as the multi-dimensional feature code in Step 1;
    步骤3、采用生成对抗网络丰富已有标签数据;Step 3. Enrich the existing tag data by generating an adversarial network;
    步骤4、将多个模型集成在一起,反复训练,从而产生每个模型的权重因子,然后得到一个带权重的集成模型后,对步骤3得到的数据进行分类;Step 4. Integrate multiple models together and train repeatedly to generate a weighting factor for each model. After obtaining an integrated model with weights, classify the data obtained in step 3;
    步骤5、输出预测行为。Step 5. Output prediction behavior.
  2. 如权利要求1所述的行为预测方法,其特征在于:所述步骤1中One-Hot Encoding编码部分的数据是采用了二进制数字来表示的一系列相同属性的数字;所述数据只是表示一个客观事实,并没有数值含义。The behavior prediction method according to claim 1, characterized in that: the data of the One-Hot Encoding part in step 1 is a series of numbers with the same attribute expressed by binary numbers; the data only represents an objective In fact, there is no numerical meaning.
  3. 如权利要求1所述的行为预测方法,其特征在于:所述步骤1中Label Encoding编码部分的数据表示一个权重或者数值,具有数学意义;所述数据之间有关联,类内有联系;所述数据采用十进制数表示不超过两位数。The behavior prediction method according to claim 1, characterized in that: the data in the Label Encoding part of step 1 represents a weight or value, which has a mathematical meaning; there is a correlation between the data, and there is a connection within the class; The data is expressed in decimal not exceeding two digits.
  4. 如权利要求1所述的行为预测方法,其特征在于:所述步骤3包括通过生成器反复生成假数据,然后通过判别器鉴别生成数据是否是假数据,不断博弈,直到再也无法判断出这是假数据还是真实数据;将这些制造的数据用以平衡样本数据集。The behavior prediction method according to claim 1, wherein step 3 includes repeatedly generating fake data through a generator, and then discriminating whether the generated data is fake data through a discriminator, and continually playing games until it is no longer possible to judge this Whether it is fake data or real data; use these manufactured data to balance the sample data set.
  5. 如权利要求4所述的行为预测方法,其特征在于:所述制造的数据的判别公式为:The behavior prediction method according to claim 4, wherein the discrimination formula of the manufactured data is:
    Figure PCTCN2019121492-appb-100001
    Figure PCTCN2019121492-appb-100001
    其中,D(x)表示判别器判断下的数据取自原始数据的概率;D(G(z))表示判别器判断下的数据取自生成器的概率;x~P data表示数据来自原始数据;z~P Z(z)表示数据来自生成器;
    Figure PCTCN2019121492-appb-100002
    代表求其均值;
    Among them, D(x) represents the probability that the data judged by the discriminator is taken from the original data; D(G(z)) represents the probability that the data judged by the discriminator is taken from the generator; x~P data indicates that the data comes from the original data ; Z~P Z(z) indicates that the data comes from the generator;
    Figure PCTCN2019121492-appb-100002
    Represent the average value;
    Min(G)Max(D)P(D,G)表示在当前生成器和判别器P(D,G)情况下,保证最大化判别器max(D)的同时做到Min(G)生成器误差最小。Min(G)Max(D)P(D,G) means that under the current generator and discriminator P(D,G), it is guaranteed to maximize the discriminator max(D) while achieving the Min(G) generator The error is minimal.
  6. 如权利要求1所述的行为预测方法,其特征在于:所述步骤4包括对不同的数据采用不同的模型进行训练然后找出所有训练中表现最好的几种模型,依据其分类得到的正确率,根据准确度从大到小赋予从大到小的权重比例。The behavior prediction method according to claim 1, wherein the step 4 includes training different models on different data and then finding out the best performing models in all the trainings, according to the correct classification Rate, according to the accuracy from the largest to the smallest weight ratio.
  7. 如权利要求6所述的行为预测方法,其特征在于:所述权重比例之和为1。The behavior prediction method according to claim 6, wherein the sum of the weight ratios is 1.
  8. 如权利要求7所述的行为预测方法,其特征在于:所述表现最好的几种模型包括分类决策树模型、随机森林模型、AdaBoost模型和XGBoost模型。The behavior prediction method according to claim 7, wherein the best performing models include a classification decision tree model, a random forest model, an AdaBoost model, and an XGBoost model.
  9. 如权利要求8所述的行为预测方法,其特征在于:所述模型判别式为:The behavior prediction method according to claim 8, wherein the model discriminant is:
    f(x)=(ω 11223344)/4 f(x)=(ω 11223344 )/4
    其中,ω 1,ω 2,ω 3,ω 4为四种模型赋予的权重,θ 1,θ 2,θ 3,θ 4为分类决策树模型得到的预测结果; Among them, ω 1 , ω 2 , ω 3 , ω 4 are the weights given by the four models, θ 1 , θ 2 , θ 3 , θ 4 are the prediction results obtained by the classification decision tree model;
    若f(x)的值超过设定的阈值,则判断该样本是正例,未超过则为负例。If the value of f(x) exceeds the set threshold, it is judged that the sample is a positive example, and if it does not exceed it, it is a negative example.
  10. 如权利要求1~9中任一项所述的行为预测方法,其特征在于:所述行为包括金融投资行为。The behavior prediction method according to any one of claims 1 to 9, wherein the behavior includes financial investment behavior.
PCT/CN2019/121492 2018-12-04 2019-11-28 Behavior prediction method WO2020114302A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811473054.4A CN109766911A (en) 2018-12-04 2018-12-04 A kind of behavior prediction method
CN201811473054.4 2018-12-04

Publications (1)

Publication Number Publication Date
WO2020114302A1 true WO2020114302A1 (en) 2020-06-11

Family

ID=66450482

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121492 WO2020114302A1 (en) 2018-12-04 2019-11-28 Behavior prediction method

Country Status (2)

Country Link
CN (1) CN109766911A (en)
WO (1) WO2020114302A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990284A (en) * 2021-03-04 2021-06-18 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN115035722A (en) * 2022-06-20 2022-09-09 浙江嘉兴数字城市实验室有限公司 Road safety risk prediction method based on combination of spatio-temporal features and social media

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766911A (en) * 2018-12-04 2019-05-17 深圳先进技术研究院 A kind of behavior prediction method
CN110222750A (en) * 2019-05-27 2019-09-10 北京品友互动信息技术股份公司 The determination method and device of target audience's concentration
CN112036955B (en) * 2020-09-07 2021-09-24 贝壳找房(北京)科技有限公司 User identification method and device, computer readable storage medium and electronic equipment

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845731A (en) * 2017-02-20 2017-06-13 重庆邮电大学 A kind of potential renewal user based on multi-model fusion has found method
CN107766888A (en) * 2017-10-24 2018-03-06 众安信息技术服务有限公司 Data processing method and device
CN108764597A (en) * 2018-04-02 2018-11-06 华南理工大学 A kind of product quality control method based on integrated study
CN108875916A (en) * 2018-06-27 2018-11-23 北京工业大学 A kind of ad click rate prediction technique based on GRU neural network
CN109766911A (en) * 2018-12-04 2019-05-17 深圳先进技术研究院 A kind of behavior prediction method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105654102A (en) * 2014-11-10 2016-06-08 富士通株式会社 Data processing device and data processing method
CN107895283B (en) * 2017-11-07 2021-02-09 重庆邮电大学 Merchant passenger flow volume big data prediction method based on time series decomposition
CN108470187A (en) * 2018-02-26 2018-08-31 华南理工大学 A kind of class imbalance question classification method based on expansion training dataset
CN108492141A (en) * 2018-03-28 2018-09-04 联想(北京)有限公司 A kind of prediction technique and device of multi-model fusion

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106845731A (en) * 2017-02-20 2017-06-13 重庆邮电大学 A kind of potential renewal user based on multi-model fusion has found method
CN107766888A (en) * 2017-10-24 2018-03-06 众安信息技术服务有限公司 Data processing method and device
CN108764597A (en) * 2018-04-02 2018-11-06 华南理工大学 A kind of product quality control method based on integrated study
CN108875916A (en) * 2018-06-27 2018-11-23 北京工业大学 A kind of ad click rate prediction technique based on GRU neural network
CN109766911A (en) * 2018-12-04 2019-05-17 深圳先进技术研究院 A kind of behavior prediction method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XU XIAO: "Research on Image Semantic Segmentation Based on Deep Learning", MASTER THESIS, 15 January 2018 (2018-01-15), pages 1 - 58, XP009521592 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112990284A (en) * 2021-03-04 2021-06-18 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN112990284B (en) * 2021-03-04 2022-11-22 安徽大学 Individual trip behavior prediction method, system and terminal based on XGboost algorithm
CN115035722A (en) * 2022-06-20 2022-09-09 浙江嘉兴数字城市实验室有限公司 Road safety risk prediction method based on combination of spatio-temporal features and social media
CN115035722B (en) * 2022-06-20 2024-04-05 浙江嘉兴数字城市实验室有限公司 Road safety risk prediction method based on combination of space-time characteristics and social media

Also Published As

Publication number Publication date
CN109766911A (en) 2019-05-17

Similar Documents

Publication Publication Date Title
WO2020114302A1 (en) Behavior prediction method
Al Amrani et al. Random forest and support vector machine based hybrid approach to sentiment analysis
CN109815336B (en) Text aggregation method and system
CN111754345B (en) Bit currency address classification method based on improved random forest
CN112070125A (en) Prediction method of unbalanced data set based on isolated forest learning
WO2022126810A1 (en) Text clustering method
CN110956210B (en) Semi-supervised network water force identification method and system based on AP clustering
US20140089247A1 (en) Fast Binary Rule Extraction for Large Scale Text Data
CN114998602B (en) Domain adaptive learning method and system based on low confidence sample contrast loss
CN111144106A (en) Two-stage text feature selection method under unbalanced data set
CN112527958A (en) User behavior tendency identification method, device, equipment and storage medium
CN115577357A (en) Android malicious software detection method based on stacking integration technology
CN113821827A (en) Joint modeling method and device for protecting multi-party data privacy
CN112508726A (en) False public opinion identification system based on information spreading characteristics and processing method thereof
CN112148919A (en) Music click rate prediction method and device based on gradient lifting tree algorithm
CN110472056A (en) A kind of comment data classification method and system
CN113792541B (en) Aspect-level emotion analysis method introducing mutual information regularizer
CN112463964B (en) Text classification and model training method, device, equipment and storage medium
CN114547294A (en) Rumor detection method and system based on comprehensive information of propagation process
CN114048796A (en) Improved hard disk failure prediction method and device
CN109308565B (en) Crowd performance grade identification method and device, storage medium and computer equipment
CN111127184A (en) Distributed combined credit evaluation method
Fernandes de Araújo et al. Leveraging active learning to reduce human effort in the generation of ground‐truth for entity resolution
CN112446206A (en) Menu title generation method and device
CN112132367A (en) Modeling method and device for enterprise operation management risk identification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19892686

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 03/11/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19892686

Country of ref document: EP

Kind code of ref document: A1