CN111563775A

CN111563775A - Crowd division method and device

Info

Publication number: CN111563775A
Application number: CN202010383874.5A
Authority: CN
Inventors: 李见黎
Original assignee: Beijing Shenyan Intelligent Technology Co ltd
Current assignee: Beijing Shenyan Intelligent Technology Co ltd
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-08-21

Abstract

The invention discloses a crowd division method and device. Wherein, the method comprises the following steps: receiving behavior data to be evaluated, wherein the behavior data is data for operating target information; and inputting the behavior data into an evaluation model, and outputting scores of the behavior data by the evaluation model, wherein the evaluation model is a machine learning model, the evaluation model is obtained by training multiple groups of training data, and each group of training data comprises historical behavior data and corresponding scores. The invention solves the technical problem that the actual condition of the crowd is difficult to reflect due to the fact that certain standards are lacked in the crowd division according to the labels in the related technology.

Description

Crowd division method and device

技术领域technical field

本发明涉及人群划分领域，具体而言，涉及一种人群划分方法及装置。The present invention relates to the field of crowd division, and in particular, to a crowd division method and device.

背景技术Background technique

在广告投放过程中，需要有一定人群划分而在进行人群划分的时候往往缺少一定的标准或者是基于某些固定的标签，这样较难反映人群的当时状况。一种基于人群的实时行为来进行的评分就能更好的反映这种价值。In the process of advertising, there needs to be a certain group of people to be divided, and when the group is divided, there is often a lack of certain standards or based on certain fixed labels, which makes it difficult to reflect the current situation of the crowd. A score based on the real-time behavior of the crowd can better reflect this value.

针对上述的问题，目前尚未提出有效的解决方案。For the above problems, no effective solution has been proposed yet.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供了一种人群划分方法及装置，以至少解决相关技术中根据标签进行人群划分缺少一定标准，导致难以反映人群的实际情况的技术问题。Embodiments of the present invention provide a method and apparatus for classifying a crowd, so as to at least solve the technical problem in the related art that there is a lack of certain standards for classifying crowds according to tags, which makes it difficult to reflect the actual situation of the crowd.

根据本发明实施例的一个方面，提供了一种人群划分方法，包括：接收人群的行为数据，其中，所述行为数据为所述人群对投放的目标信息进行操作的数据；将所述行为数据输入评估模型，由所述评估模型输出所述行为数据的评分，其中，所述评估模型为机器学习模型，所述评估模型通过多组训练数据训练得到，每组所述训练数据包括历史行为数据和对应的评分；根据所述评分对所述人群进行分类。According to an aspect of the embodiments of the present invention, a method for classifying a crowd is provided, which includes: receiving behavior data of a crowd, wherein the behavior data is data of the crowd operating on target information put in; The evaluation model is input, and the evaluation model outputs the score of the behavior data, wherein the evaluation model is a machine learning model, and the evaluation model is obtained by training multiple sets of training data, and each set of the training data includes historical behavior data and corresponding scores; classify the population according to the scores.

可选的，将所述行为数据输入所述评估模型，由所述评估模型输出所述行为数据的评分之前，包括：根据所述历史行为数据的数据特征，选取模型算法，构建基础模型；通过所述历史行为数据和对应的评分，建立多组训练数据，对所述基础模型进行训练；对训练后的所述基础模型进行优化，确定所述评估模型。Optionally, inputting the behavior data into the evaluation model, and before outputting the score of the behavior data by the evaluation model, includes: selecting a model algorithm according to the data characteristics of the historical behavior data, and constructing a basic model; The historical behavior data and corresponding scores are used to establish multiple sets of training data to train the basic model; the trained basic model is optimized to determine the evaluation model.

可选的，对训练后的所述基础模型进行优化，确定所述评估模型包括下列至少之一：通过模型调参算法，对训练后的所述基础模型的参数进行优化调整，确定所述评估模型；对所述训练数据进行切分，组合为不同的训练数据，对所述评估模型进行再次训练，确定所述评估模型；对多次训练的多个不同模型进行融合，确定所述评估模型。Optionally, optimizing the trained basic model, and determining that the evaluation model includes at least one of the following: optimizing and adjusting the parameters of the trained basic model through a model parameter adjustment algorithm, and determining the evaluation model. model; split the training data, combine them into different training data, retrain the evaluation model to determine the evaluation model; fuse multiple different models trained for multiple times to determine the evaluation model .

可选的，对所述训练数据进行切分，组合为不同的训练数据，对所述评估模型进行再次训练包括：通过交叉验证的方式，对所述训练数据进行切分，组合为不同的训练数据，对所述评估模型进行再次训练；其中，所述交叉验证包括下列至少之一：简单交叉验证法，S折交叉验证法，留一交叉验证法。Optionally, dividing the training data and combining them into different training data, and retraining the evaluation model includes: dividing the training data by means of cross-validation, and combining them into different training data. data, and retrain the evaluation model; wherein, the cross-validation includes at least one of the following: a simple cross-validation method, an S-fold cross-validation method, and a leave-one-out cross-validation method.

可选的，对多次训练的多个不同模型进行融合包括下列至少之一：通过加权平均的方式，对多个不同所述模型进行融合；通过加权投票的方式，对多个不同所述模型进行融合；通过次级学习器，将作为多个初级学习器的多个不同所述模型进行融合。Optionally, fusing multiple different models trained for multiple times includes at least one of the following: fusing multiple different models by means of weighted average; Fusion is performed; through the secondary learners, a plurality of different said models, which are multiple primary learners, are fused.

可选的，将所述行为数据输入所述评估模型，由所述评估模型输出所述行为数据的评分之后，还包括：确定所述行为数据的预测评分曲线；根据所述历史行为数据的评分曲线对所述预测评分曲线进行校准，确定所述行为数据的校准后的评分。Optionally, the behavior data is input into the evaluation model, and after the evaluation model outputs the score of the behavior data, the method further includes: determining a predicted scoring curve of the behavior data; scoring according to the historical behavior data The curve calibrates the predicted score curve to determine a calibrated score for the behavioral data.

可选的，根据所述评分对所述人群进行分类包括：根据所述评分，以及预设的评分等级，确定所述人群属于所述行为数据落入的评分等级对应的人群类别，其中，预设的所述评分等级为多个。Optionally, classifying the crowd according to the score includes: according to the score and a preset score level, determining that the crowd belongs to the crowd category corresponding to the score level to which the behavior data falls, wherein the predetermined It is assumed that there are multiple rating levels.

根据本发明实施例的另一方面，还提供了一种人群划分装置，包括：接收模块，用于接收人群的行为数据，其中，所述行为数据为所述人群对投放的目标信息进行操作的数据；评分模块，用于将所述行为数据输入评估模型，由所述评估模型输出所述行为数据的评分，其中，所述评估模型为机器学习模型，所述评估模型通过多组训练数据训练得到，每组所述训练数据包括历史行为数据和对应的评分；分类模块，用于根据所述评分对所述人群进行分类。According to another aspect of the embodiments of the present invention, there is also provided an apparatus for classifying a crowd, including: a receiving module configured to receive behavior data of a crowd, wherein the behavior data is obtained by the crowd operating on the posted target information data; a scoring module for inputting the behavioral data into an evaluation model, and the evaluation model outputs a score for the behavioral data, wherein the evaluation model is a machine learning model, and the evaluation model is trained through multiple sets of training data It is obtained that each group of the training data includes historical behavior data and corresponding scores; a classification module is used to classify the crowd according to the scores.

根据本发明实施例的另一方面，还提供了一种存储介质，所述存储介质包括存储的程序，其中，在所述程序运行时控制所述存储介质所在设备执行上述中任意一项所述的人群划分方法。According to another aspect of the embodiments of the present invention, a storage medium is further provided, and the storage medium includes a stored program, wherein when the program runs, the device where the storage medium is located is controlled to execute any one of the above population segmentation method.

根据本发明实施例的另一方面，还提供了一种处理器，所述处理器用于运行程序，其中，所述程序运行时执行上述中任意一项所述的人群划分方法。According to another aspect of the embodiments of the present invention, a processor is further provided, and the processor is configured to run a program, wherein when the program runs, any one of the above-mentioned methods for dividing a crowd is executed.

在本发明实施例中，采用接收人群的行为数据，其中，行为数据为人群对投放的目标信息进行操作的数据；将行为数据输入评估模型，由评估模型输出行为数据的评分，其中，评估模型为机器学习模型，评估模型通过多组训练数据训练得到，每组训练数据包括历史行为数据和对应的评分；根据评分对人群进行分类的方式，通过对人群的行为数据，通过评估模型进行评分，按照人群的行为数据的评分对人群进行分类，达到了准确对人群分类的目的，从而实现了提高人群分类的准确率的技术效果，进而解决了相关技术中根据标签进行人群划分缺少一定标准，导致难以反映人群的实际情况的技术问题。In the embodiment of the present invention, the behavior data of the receiving crowd is used, wherein the behavior data is the data that the crowd operates on the target information put in; the behavior data is input into the evaluation model, and the evaluation model outputs the score of the behavior data, wherein the evaluation model It is a machine learning model, and the evaluation model is obtained by training multiple sets of training data, and each set of training data includes historical behavior data and corresponding scores; the way to classify the crowd according to the score, through the behavior data of the crowd, through the evaluation model to score, The crowd is classified according to the scores of the crowd's behavior data, which achieves the purpose of accurately classifying the crowd, thereby achieving the technical effect of improving the accuracy of crowd classification, and solving the lack of certain standards for classifying crowds based on labels in related technologies, resulting in Technical problems that are difficult to reflect the actual situation of the population.

附图说明Description of drawings

此处所说明的附图用来提供对本发明的进一步理解，构成本申请的一部分，本发明的示意性实施例及其说明用于解释本发明，并不构成对本发明的不当限定。在附图中：The accompanying drawings described herein are used to provide a further understanding of the present invention and constitute a part of the present application. The exemplary embodiments of the present invention and their descriptions are used to explain the present invention and do not constitute an improper limitation of the present invention. In the attached image:

图1是根据本发明实施例的一种人群划分方法的流程图；1 is a flowchart of a method for dividing a crowd according to an embodiment of the present invention;

图2是根据本发明实施方式的人群划分方案的示意图；2 is a schematic diagram of a crowd division scheme according to an embodiment of the present invention;

图3是根据本发明实施方式的人群划分方案的流程图；3 is a flowchart of a crowd division scheme according to an embodiment of the present invention;

图4是根据本发明实施方式的预测分布曲线的示意图；4 is a schematic diagram of a predicted distribution curve according to an embodiment of the present invention;

图5是根据本发明实施方式的校准后的预测分布曲线的示意图；5 is a schematic diagram of a calibrated prediction distribution curve according to an embodiment of the present invention;

图6是根据本发明实施例的一种人群划分装置的示意图。FIG. 6 is a schematic diagram of an apparatus for dividing a crowd according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本技术领域的人员更好地理解本发明方案，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分的实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都应当属于本发明保护的范围。In order to make those skilled in the art better understand the solutions of the present invention, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only Embodiments are part of the present invention, but not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

需要说明的是，本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be noted that the terms "first", "second" and the like in the description and claims of the present invention and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific sequence or sequence. It is to be understood that the data so used may be interchanged under appropriate circumstances such that the embodiments of the invention described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed Rather, those steps or units may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.

根据本发明实施例，提供了一种人群划分方法的方法实施例，需要说明的是，在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行，并且，虽然在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于此处的顺序执行所示出或描述的步骤。According to an embodiment of the present invention, a method embodiment of a method for dividing a crowd is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings may be executed in a computer system such as a set of computer-executable instructions, and, Although a logical order is shown in the flowcharts, in some cases steps shown or described may be performed in an order different from that herein.

图1是根据本发明实施例的一种人群划分方法的流程图，如图1所示，该方法包括如下步骤：FIG. 1 is a flowchart of a method for dividing a crowd according to an embodiment of the present invention. As shown in FIG. 1 , the method includes the following steps:

步骤S102，接收人群的行为数据，其中，行为数据为人群对投放的目标信息进行操作的数据；Step S102, receiving behavior data of the crowd, wherein the behavior data is the data of the crowd operating the target information put in;

步骤S104，将行为数据输入评估模型，由评估模型输出行为数据的评分，其中，评估模型为机器学习模型，评估模型通过多组训练数据训练得到，每组训练数据包括历史行为数据和对应的评分；Step S104, input the behavior data into the evaluation model, and the evaluation model outputs the score of the behavior data, wherein the evaluation model is a machine learning model, and the evaluation model is obtained by training multiple sets of training data, and each group of training data includes historical behavior data and corresponding scores. ;

步骤S106，根据评分对人群进行分类。Step S106, classify the crowd according to the score.

通过上述步骤，采用接收人群的行为数据，其中，行为数据为人群对投放的目标信息进行操作的数据；将行为数据输入评估模型，由评估模型输出行为数据的评分，其中，评估模型为机器学习模型，评估模型通过多组训练数据训练得到，每组训练数据包括历史行为数据和对应的评分；根据评分对人群进行分类的方式，通过对人群的行为数据，通过评估模型进行评分，按照人群的行为数据的评分对人群进行分类，达到了准确对人群分类的目的，从而实现了提高人群分类的准确率的技术效果，进而解决了相关技术中根据标签进行人群划分缺少一定标准，导致难以反映人群的实际情况的技术问题。Through the above steps, the behavior data of the receiving crowd is adopted, wherein the behavior data is the data that the crowd operates on the target information put in; the behavior data is input into the evaluation model, and the evaluation model outputs the score of the behavior data, wherein the evaluation model is machine learning The model and the evaluation model are obtained by training multiple sets of training data, each set of training data includes historical behavior data and corresponding scores; the way to classify the crowd according to the score, through the behavior data of the crowd, through the evaluation model to score, according to the crowd's behavior data. The score of behavioral data classifies the crowd, achieves the purpose of accurately classifying the crowd, thereby achieving the technical effect of improving the accuracy of crowd classification, and solving the problem of the lack of certain standards for classifying crowds based on labels in related technologies, which makes it difficult to reflect the crowd. the actual situation of technical problems.

上述人群可以包括多个用户账户，上述投放的目标信息可以为广告信息，上述行为数据可以是上述人群的用户账户接收到广告信息后的操作信息，例如，曝光，点击，收藏，保存等，通过上述行为数据可以确定上述用户账户是否对上述广告信息具有投放价值，例如，对广告信息的曝光时间越长，曝光次数越多，说明该用户账户对上述广告信息越感兴趣，也就越有可能性进行购买，从而产生广告投放的效益。在例如，点击率越高，将该广告信息收藏或者保存，都可以说明该用户账户对上述广告感兴趣，也就说明该用户账户对广告的投放价值越高。The above-mentioned group of people may include multiple user accounts, the target information of the above-mentioned delivery may be advertisement information, and the above-mentioned behavior data may be the operation information after the user account of the above-mentioned group of people receives the advertisement information, such as exposure, click, collection, saving, etc. The above behavior data can determine whether the above user account has value for the above advertisement information. For example, the longer the exposure time to the advertisement information and the more exposure times, it means that the user account is more interested in the above advertisement information, and the more likely it is Sexual purchase, resulting in the benefits of advertising. For example, if the click rate is higher and the advertisement information is collected or saved, it can indicate that the user account is interested in the above advertisement, which means that the user account has a higher advertising value.

上述评估模型可以为机器学习模型，可以包括机器学习网络，深度学习网络，卷积神经网络等，可以包括输入层，中间层和输出层，上述中间层的数量可以为多个，通过多组训练数据对上述机器学习模型进行训练，每组训练数据包括历史行为数据和对应的评分。上述历史行为数据的评分可以是历史行为数据对上述广告信息产生的价值确定的，所产生的价值越高，上述评分越高。The above evaluation model can be a machine learning model, which can include a machine learning network, a deep learning network, a convolutional neural network, etc., and can include an input layer, an intermediate layer, and an output layer. The data is used to train the above machine learning model, and each set of training data includes historical behavior data and corresponding scores. The score of the above-mentioned historical behavior data may be determined by the value generated by the historical behavior data to the above-mentioned advertisement information, and the higher the generated value, the higher the above-mentioned score.

根据上述评估模型输入行为数据，由上述评估模型输出行为属于的评分，从而对人群的行为数据进行打分，进而确定人群的行为数据对广告信息所产生的价值，根据该价值可以为对人群进行划分，以根据不同价值的人群进行不同的投放信息策略。例如，投放价值越低，投放频率越小，投放时间越短。The behavior data is input according to the above evaluation model, and the behavior belongs to the score output by the above evaluation model, so as to score the behavior data of the crowd, and then determine the value of the behavior data of the crowd to the advertising information. According to this value, the crowd can be divided into , in order to carry out different information delivery strategies according to different value groups. For example, the lower the delivery value, the lower the delivery frequency and the shorter the delivery time.

上述根据评分对人群进行分类，相比于相关技术中根据标签对人群进行分类的方式，提高了人群分类的准确率的技术效果，进而解决了相关技术中根据标签进行人群划分缺少一定标准，导致难以反映人群的实际情况的技术问题。The above-mentioned classification of crowds based on scores improves the technical effect of the accuracy of crowd classification compared to the method of classifying crowds based on tags in the related art, thereby solving the lack of certain standards for classifying crowds based on tags in related technologies, resulting in Technical problems that are difficult to reflect the actual situation of the population.

可选的，将行为数据输入评估模型，由评估模型输出行为数据的评分之前，包括：根据历史行为数据的数据特征，选取模型算法，构建基础模型；通过历史行为数据和对应的评分，建立多组训练数据，对基础模型进行训练；对训练后的基础模型进行优化，确定评估模型。Optionally, the behavior data is input into the evaluation model, and before the evaluation model outputs the score of the behavior data, it includes: selecting a model algorithm according to the data characteristics of the historical behavior data, and constructing a basic model; Group training data to train the basic model; optimize the trained basic model to determine the evaluation model.

具体的，在上述广告信息投放的人群分类的场景中，根据历史行为数据的数据特征，选取模型算法，可以考虑基础数据特征并非高维、稀疏相对较为稠密，基于算法的固有优势：正则化；大规模数据的并行性训练；灵活性以及缺失值处理等特性，本实施方式选择XGBoost算法，模型AUC区间在不同数据集上在0.75～0.82之间。Specifically, in the above-mentioned scenario of crowd classification for advertisement information delivery, the model algorithm is selected according to the data characteristics of historical behavior data, and it can be considered that the basic data characteristics are not high-dimensional, sparse and relatively dense, based on the inherent advantage of the algorithm: regularization; For the characteristics of parallel training of large-scale data, flexibility, and missing value processing, the XGBoost algorithm is selected in this embodiment, and the model AUC interval is between 0.75 and 0.82 on different data sets.

上述通过历史行为数据和对应的评分可以存储在数据库中，该数据库中的历史行为数据和对应的评分随着使用随时更新数据，以保证数据的有效性。从上述数据库中选取预设数量的历史行为数据和对应的评分，建立多组训练数据，每组训练数据包括历史行为数据和对应的评分，对基础模型进行训练，对训练后的基础模型进行优化，确定评估模型。The above-mentioned historical behavior data and corresponding scores can be stored in a database, and the historical behavior data and corresponding scores in the database can be updated at any time with use to ensure the validity of the data. Select a preset number of historical behavior data and corresponding scores from the above databases, establish multiple sets of training data, each set of training data includes historical behavior data and corresponding scores, train the basic model, and optimize the trained basic model , determine the evaluation model.

可选的，对训练后的基础模型进行优化，确定评估模型包括下列至少之一：通过模型调参算法，对训练后的基础模型的参数进行优化调整，确定评估模型；对训练数据进行切分，组合为不同的训练数据，对评估模型进行再次训练，确定评估模型；对多次训练的多个不同模型进行融合，确定评估模型。Optionally, optimize the trained basic model, and determine that the evaluation model includes at least one of the following: optimize and adjust the parameters of the trained basic model through a model parameter adjustment algorithm, and determine the evaluation model; segment the training data , combined into different training data, retrain the evaluation model to determine the evaluation model; fuse multiple different models trained for multiple times to determine the evaluation model.

在上述投放广告信息的人群分类的场景下，上述通过模型调参算法，对训练后的基础模型的参数进行优化调整，可以是采用GridSearchCV(一种调参算法)多次验证，优化模型参数：n_estimators最优值为400、max_depth值为10、min_child_weight值为5、colsample_bytree为0.3、learning rate为0.1等。In the above-mentioned scenario of crowd classification for advertising information, the above-mentioned model parameter adjustment algorithm is used to optimize and adjust the parameters of the basic model after training. GridSearchCV (a parameter adjustment algorithm) can be used for multiple verifications to optimize the model parameters: The optimal value of n_estimators is 400, the value of max_depth is 10, the value of min_child_weight is 5, the value of colsample_bytree is 0.3, the value of learning rate is 0.1, etc.

上述对训练数据进行切分，组合为不同的训练数据，对评估模型进行再次训练，也即是对上述评估模型进行交叉验证，具体的，重复的使用数据，把得到的样本数据进行切分，组合为不同的训练集和测试集，用训练集来训练模型，用测试集来评估模型预测的好坏。The above-mentioned segmentation of the training data is combined into different training data, and the evaluation model is retrained, that is, the above-mentioned evaluation model is cross-validated. Specifically, the data is used repeatedly, and the obtained sample data is divided, Combined into different training sets and test sets, use the training set to train the model, and use the test set to evaluate how well the model predicts.

可选的，对训练数据进行切分，组合为不同的训练数据，对评估模型进行再次训练包括：通过交叉验证的方式，对训练数据进行切分，组合为不同的训练数据，对评估模型进行再次训练；其中，交叉验证包括下列至少之一：简单交叉验证法，S折交叉验证法，留一交叉验证法。Optionally, dividing the training data, combining them into different training data, and retraining the evaluation model includes: dividing the training data by means of cross-validation, combining them into different training data, and performing the evaluation on the evaluation model. Retraining; wherein, the cross-validation includes at least one of the following: a simple cross-validation method, an S-fold cross-validation method, and a leave-one-out cross-validation method.

在本实施例的上述投放广告信息的人群分类的场景下，选取S折交叉验证法，提升评估模型的范化能力，找到最优的模型参数。In the above-mentioned scenario of the above-mentioned classification of the people who put advertisement information in this embodiment, the S-fold cross-validation method is selected to improve the normalization ability of the evaluation model and find the optimal model parameters.

可选的，对多次训练的多个不同模型进行融合包括下列至少之一：通过加权平均的方式，对多个不同模型进行融合；通过加权投票的方式，对多个不同模型进行融合；通过次级学习器，将作为多个初级学习器的多个不同模型进行融合。Optionally, fusing multiple different models trained for multiple times includes at least one of the following: fusing multiple different models by means of weighted average; fusing multiple different models by means of weighted voting; Secondary learners, which fuse multiple different models as multiple primary learners.

在本实施例的上述投放广告信息的人群分类的场景下，选取投票法，对于评估模型的AUC有2％左右的贡献，从而优化评估模型的模型参数。In the above-mentioned scenario of the above-mentioned classification of the people who put advertisement information in this embodiment, the voting method is selected, which contributes about 2% to the AUC of the evaluation model, thereby optimizing the model parameters of the evaluation model.

可选的，将行为数据输入评估模型，由评估模型输出行为数据的评分之后，还包括：确定行为数据的预测评分曲线；根据历史行为数据的评分曲线对预测评分曲线进行校准，确定行为数据的校准后的评分。Optionally, the behavior data is input into the evaluation model, and after the evaluation model outputs the score of the behavior data, it further includes: determining the predicted score curve of the behavior data; calibrating the predicted score curve according to the score curve of the historical behavior data, and determining the behavior data. The calibrated score.

预测值的分布和正负样本的比例有较大关系，依据真实分布做相应的校准，从而提高预测值的准确率，进而提高人群划分的准确率。The distribution of predicted values has a great relationship with the proportion of positive and negative samples, and the corresponding calibration is performed according to the real distribution, thereby improving the accuracy of predicted values and thus the accuracy of population division.

可选的，根据评分对人群进行分类包括：根据评分，以及预设的评分等级，确定人群属于行为数据落入的评分等级对应的人群类别，其中，预设的评分等级为多个。Optionally, classifying the crowd according to the score includes: according to the score and a preset score level, determining that the crowd belongs to a crowd category corresponding to the score level in which the behavior data falls, wherein there are multiple preset score levels.

需要说明的是，本实施例还提供了一种可选的实施方式，下面对该实施方式进行详细说明。It should be noted that this embodiment also provides an optional implementation manner, which will be described in detail below.

本实施方式提供了一种基于人群的实时行为来进行的评分就能更好的反映这种价值。在进行人群实时评分的时候主要关注人群的行为特征数据以及一些其他相关的数据，进行综合的构建模型来评分。具体的步骤可分为数据收集、数据处理、评分模型构建、评分模型使用等。This embodiment provides a score based on the real-time behavior of the crowd, which can better reflect this value. When performing crowd real-time scoring, the main focus is on the behavioral data of the crowd and some other related data, and a comprehensive model is constructed to score. The specific steps can be divided into data collection, data processing, scoring model construction, scoring model use and so on.

图2是根据本发明实施方式的人群划分方案的示意图；图3是根据本发明实施方式的人群划分方案的流程图；如图2和图3所示，本实施方式的模型选择，考虑基础数据特征并非高维、稀疏相对较为稠密，基于算法的固有优势：1、正则化。2、大规模数据的并行性训练。3、灵活性以及缺失值处理等特性，选择XGBoost算法，模型AUC区间在不同数据集上在0.75～0.82之间。Fig. 2 is a schematic diagram of a crowd division scheme according to an embodiment of the present invention; Fig. 3 is a flow chart of a crowd division scheme according to an embodiment of the present invention; as shown in Figs. Features are not high-dimensional, sparse and relatively dense, based on the inherent advantages of the algorithm: 1. Regularization. 2. Parallel training of large-scale data. 3. Features such as flexibility and missing value processing, select the XGBoost algorithm, and the model AUC interval is between 0.75 and 0.82 on different data sets.

优化模型包括：The optimization model includes:

1)模型参数优化1) Model parameter optimization

采用GridSearchCV多次验证，优化模型参数：n_estimators最优值为400、max_depth值为10、min_child_weight值为5、colsample_bytree为0.3、learning rate为0.1等。GridSearchCV is used for multiple verifications, and the model parameters are optimized: the optimal value of n_estimators is 400, the value of max_depth is 10, the value of min_child_weight is 5, the value of colsample_bytree is 0.3, and the value of learning rate is 0.1.

2)交叉验证2) Cross-validation

交叉验证，就是重复的使用数据，把得到的样本数据进行切分，组合为不同的训练集和测试集，用训练集来训练模型，用测试集来评估模型预测的好坏，在此基础上可以得到多组不同的训练集和测试集，一般有简单交叉法、S折交叉验证方法、留一交叉验证法。在此我们尝试使用S折交叉验证法，提升模型的范化能力，找到最优模型参数。Cross-validation is to repeatedly use data, divide the obtained sample data, combine them into different training sets and test sets, use the training set to train the model, and use the test set to evaluate the quality of the model prediction, on this basis Multiple sets of different training sets and test sets can be obtained. Generally, there are simple cross-validation methods, S-fold cross-validation methods, and leave-one-out cross-validation methods. Here we try to use the S-fold cross-validation method to improve the normalization ability of the model and find the optimal model parameters.

3)模型融合3) Model fusion

通过融合多个不同的模型，可以提升机器学习的性能。常见的模型融合方法有：By fusing multiple different models, the performance of machine learning can be improved. Common model fusion methods are:

1，平均法：平均法有一般的评价和加权平均。对于平均法来说一般用于回归预测模型中，在Boosting系列融合模型中，一般采用的是加权平均融合。1. Average method: The average method has general evaluation and weighted average. For the average method, it is generally used in regression prediction models. In the Boosting series of fusion models, weighted average fusion is generally used.

2，投票法：有绝对多数投票(得票超过一半)，相对多数投票(得票最多)，加权投票。一般用于分类模型，在bagging模型中使用。2. Voting method: there is an absolute majority vote (more than half of the votes), a relative majority vote (the most votes), and a weighted vote. Generally used for classification models, used in bagging models.

3，学习法：一种更为强大的结合策略是使用”学习法”，即通过另一个学习器来进行结合，把个体学习器称为初级学习器，用于结合的学习器称为次级学习器或元学习器。3. Learning method: A more powerful combination strategy is to use the "learning method", which is to combine through another learner, and the individual learners are called primary learners, and the learners used for combination are called secondary learners. Learner or meta-learner.

在此，我们选用投票法，对于模型的AUC有2％左右的贡献。Here, we choose the voting method, which contributes about 2% to the AUC of the model.

4)校准4) Calibration

依据真实的数据分布和预测的数据分布进行相应校准。It is calibrated according to the real data distribution and the predicted data distribution.

图4是根据本发明实施方式的预测分布曲线的示意图，如图4所示为两个数据集合的预测值分布情况，预测值的分布和正负样本的比例有较大关系，依据真实分布做相应的校准，如图5所示，图5是根据本发明实施方式的校准后的预测分布曲线的示意图。FIG. 4 is a schematic diagram of a prediction distribution curve according to an embodiment of the present invention. As shown in FIG. 4 is the distribution of prediction values of two data sets. The distribution of prediction values is closely related to the ratio of positive and negative samples. The corresponding calibration is shown in FIG. 5 , which is a schematic diagram of the predicted distribution curve after calibration according to an embodiment of the present invention.

本实施方式中，上述模型应用包括：In this embodiment, the above-mentioned model applications include:

定期对大规模数据生成标签构建用户画像服务，划分不同级别用户：高价值用户、中价值风险、低价值用户、无价值用户，后续可依据不同的价值级别采用不同应对措施。Regularly generate tags for large-scale data to build user portrait services, and divide users at different levels: high-value users, medium-value risk, low-value users, and non-value users. Follow-up measures can be adopted according to different value levels.

本实施方式对于用户价值量化的衡量，有利于后续不同的用户运营。The measurement of user value quantification in this embodiment is beneficial to subsequent operations of different users.

图6是根据本发明实施例的一种人群划分装置的示意图，如图6所示，根据本发明实施例的另一方面，还提供了一种人群划分装置，包括：接收模块62，评分模块64和分类模块66，下面对该装置进行详细说明。FIG. 6 is a schematic diagram of an apparatus for dividing a crowd according to an embodiment of the present invention. As shown in FIG. 6 , according to another aspect of the embodiment of the present invention, an apparatus for dividing a crowd is also provided, including: a receiving module 62 , a scoring module 64 and a classification module 66, the device will be described in detail below.

接收模块62，用于接收人群的行为数据，其中，行为数据为人群对投放的目标信息进行操作的数据；评分模块64，与上述接收模块62相连，用于将行为数据输入评估模型，由评估模型输出行为数据的评分，其中，评估模型为机器学习模型，评估模型通过多组训练数据训练得到，每组训练数据包括历史行为数据和对应的评分；分类模块66，与上述评分模块64相连，用于根据评分对人群进行分类。The receiving module 62 is used to receive the behavioral data of the crowd, wherein the behavioral data is the data that the crowd operates on the target information put in; the scoring module 64 is connected with the above-mentioned receiving module 62, and is used for inputting the behavioral data into the evaluation model, which is evaluated by the evaluation model. The scoring of the model output behavior data, wherein, the evaluation model is a machine learning model, and the evaluation model is obtained by training multiple groups of training data, and each group of training data includes historical behavior data and corresponding scores; The classification module 66 is connected to the above-mentioned scoring module 64, Used to classify groups of people based on ratings.

通过上述装置，采用接收模块62接收人群的行为数据，其中，行为数据为人群对投放的目标信息进行操作的数据；评分模块64将行为数据输入评估模型，由评估模型输出行为数据的评分，其中，评估模型为机器学习模型，评估模型通过多组训练数据训练得到，每组训练数据包括历史行为数据和对应的评分；分类模块66根据评分对人群进行分类的方式，通过对人群的行为数据，通过评估模型进行评分，按照人群的行为数据的评分对人群进行分类，达到了准确对人群分类的目的，从而实现了提高人群分类的准确率的技术效果，进而解决了相关技术中根据标签进行人群划分缺少一定标准，导致难以反映人群的实际情况的技术问题。Through the above device, the receiving module 62 is used to receive the behavior data of the crowd, wherein the behavior data is the data that the crowd operates on the target information put in; the scoring module 64 inputs the behavior data into the evaluation model, and the evaluation model outputs the score of the behavior data, wherein , the evaluation model is a machine learning model, and the evaluation model is obtained by training multiple groups of training data, and each group of training data includes historical behavior data and corresponding scores; the classification module 66 classifies the crowd according to the scores. The evaluation model is used for scoring, and the crowd is classified according to the score of the behavior data of the crowd, so as to achieve the purpose of accurately classifying the crowd, thereby achieving the technical effect of improving the accuracy of crowd classification, and solving the problem of classifying crowds according to labels in related technologies. The lack of certain standards for the division leads to technical problems that cannot reflect the actual situation of the population.

根据本发明实施例的另一方面，还提供了一种存储介质，存储介质包括存储的程序，其中，在程序运行时控制存储介质所在设备执行上述中任意一项的人群划分方法。According to another aspect of the embodiments of the present invention, a storage medium is further provided, the storage medium includes a stored program, wherein when the program is executed, a device where the storage medium is located is controlled to execute any one of the above-mentioned methods for classifying people.

根据本发明实施例的另一方面，还提供了一种处理器，处理器用于运行程序，其中，程序运行时执行上述中任意一项的人群划分方法。According to another aspect of the embodiments of the present invention, a processor is further provided, where the processor is configured to run a program, wherein when the program is run, any one of the above-mentioned methods for dividing a crowd is executed.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages or disadvantages of the embodiments.

在本发明的上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments of the present invention, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

在本申请所提供的几个实施例中，应该理解到，所揭露的技术内容，可通过其它的方式实现。其中，以上所描述的装置实施例仅仅是示意性的，例如所述单元的划分，可以为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，单元或模块的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed technical content can be implemented in other ways. The device embodiments described above are only illustrative, for example, the division of the units may be a logical function division, and there may be other division methods in actual implementation, for example, multiple units or components may be combined or Integration into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of units or modules, and may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components shown as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit. The above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention is essentially or the part that contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes: U disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), mobile hard disk, magnetic disk or optical disk and other media that can store program codes .

以上所述仅是本发明的优选实施方式，应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理的前提下，还可以做出若干改进和润饰，这些改进和润饰也应视为本发明的保护范围。The above are only the preferred embodiments of the present invention. It should be pointed out that for those skilled in the art, without departing from the principles of the present invention, several improvements and modifications can be made. It should be regarded as the protection scope of the present invention.

Claims

1. A method of crowd division, comprising:

receiving behavior data of a crowd, wherein the behavior data is data of the crowd operating target information to be put;

inputting the behavior data into an evaluation model, and outputting scores of the behavior data by the evaluation model, wherein the evaluation model is a machine learning model, the evaluation model is obtained by training multiple groups of training data, and each group of training data comprises historical behavior data and corresponding scores;

classifying the population according to the score.

2. The method of claim 1, wherein inputting the behavior data into the assessment model, prior to outputting the score for the behavior data by the assessment model, comprises:

selecting a model algorithm according to the data characteristics of the historical behavior data to construct a basic model;

establishing a plurality of groups of training data through the historical behavior data and the corresponding scores, and training the basic model;

and optimizing the trained basic model, and determining the evaluation model.

3. The method of claim 2, wherein the trained base model is optimized and determining the evaluation model comprises at least one of:

optimizing and adjusting the parameters of the trained basic model through a model parameter adjusting algorithm to determine the evaluation model;

segmenting the training data, combining the training data into different training data, and training the evaluation model again to determine the evaluation model;

and fusing a plurality of different models trained for a plurality of times to determine the evaluation model.

4. The method of claim 3, wherein the training data is segmented and combined into different training data, and wherein retraining the evaluation model comprises:

segmenting the training data in a cross validation mode, combining the training data into different training data, and training the evaluation model again;

wherein the cross-validation comprises at least one of: a simple cross verification method, an S-turn cross verification method and a left cross verification method.

5. The method of claim 3, wherein fusing the plurality of different models trained a plurality of times comprises at least one of:

fusing a plurality of different models in a weighted average mode;

fusing a plurality of different models in a weighted voting mode;

a plurality of different said models, being a plurality of primary learners, are fused by a secondary learner.

6. The method of claim 1, wherein inputting the behavior data into the assessment model, after outputting the score for the behavior data by the assessment model, further comprises:

determining a predictive scoring curve for the behavioral data;

and calibrating the prediction scoring curve according to the scoring curve of the historical behavior data, and determining the calibrated score of the behavior data.

7. The method of claim 1, wherein classifying the population according to the score comprises:

and determining that the crowd belongs to the crowd category corresponding to the grading level in which the behavior data falls according to the grading and a preset grading level, wherein the preset grading level is multiple.

8. A crowd-sourcing device, comprising:

the system comprises a receiving module, a display module and a display module, wherein the receiving module is used for receiving behavior data of a crowd, and the behavior data is data of the crowd operating target information to be released;

the evaluation module is used for inputting the behavior data into an evaluation model and outputting the scores of the behavior data by the evaluation model, wherein the evaluation model is a machine learning model, the evaluation model is obtained by training a plurality of groups of training data, and each group of training data comprises historical behavior data and corresponding scores;

and the classification module is used for classifying the crowd according to the scores.

9. A storage medium comprising a stored program, wherein the program, when executed, controls a device on which the storage medium is located to perform the crowd division method according to any one of claims 1 to 7.

10. A processor configured to run a program, wherein the program is configured to perform the crowd division method according to any one of claims 1 to 7 when running.