CN111581877A

CN111581877A - Sample model training method, sample generation method, device, equipment and medium

Info

Publication number: CN111581877A
Application number: CN202010218666.XA
Authority: CN
Inventors: 张跃
Original assignee: Ping An Life Insurance Company of China Ltd
Current assignee: Ping An Life Insurance Company of China Ltd
Priority date: 2020-03-25
Filing date: 2020-03-25
Publication date: 2020-08-25
Anticipated expiration: 2040-03-25
Also published as: CN111581877B

Abstract

The invention discloses a sample model training method, a sample generation device, sample model training equipment and a sample model generation medium. The method comprises the following steps: acquiring original training data, wherein the original training data comprises a sample label and characteristic data corresponding to at least two sample characteristics; inputting original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combination characteristics; performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial forest model based on the effective leaf nodes to obtain an effective forest model; inputting original training data into an effective forest model to obtain effective high-order combination characteristics; and performing LR regularized screening based on the sample labels and the effective high-order combination characteristics, determining target leaf nodes, and performing pruning on the effective forest model based on the target leaf nodes to obtain the target forest model. The dimensionality of the training sample of the target forest model output model is high, and the timeliness and the accuracy of model training can be guaranteed.

Description

Sample model training method, sample generation method, device, equipment and medium

技术领域technical field

本发明涉及数据处理技术领域，尤其涉及一种样本模型训练方法、样本生成方法、装置、设备及介质。The present invention relates to the technical field of data processing, and in particular, to a sample model training method, a sample generation method, an apparatus, a device and a medium.

背景技术Background technique

由于DeepFM算法有效结合因子分解机与神经网络在特征学习中的优点，可以同时提取到低阶组合特征与高阶组合特征，使其在不同领域被广泛使用。例如，可以采用用户访问系统或者其他场景形成的用户画像数据作为模型训练样本，将模型训练样本输入DeepFM模型进行模型训练，更新DeepFM模型的模型参数，构建基于DeepFM的用户画像分析模型，使得该用户画像分析模型可以同时提取低阶组合特征和高阶组合特征，使其分析结果更准确。Because the DeepFM algorithm effectively combines the advantages of factorization machines and neural networks in feature learning, it can extract low-order combined features and high-order combined features at the same time, making it widely used in different fields. For example, user portrait data formed by user access to the system or other scenarios can be used as model training samples, the model training samples can be input into the DeepFM model for model training, the model parameters of the DeepFM model can be updated, and a DeepFM-based user portrait analysis model can be constructed so that the user The portrait analysis model can extract low-order combined features and high-level combined features at the same time, making its analysis results more accurate.

当DeepFM模型训练过程中，每一模型训练样本包括至少两个样本特征对应的数据域，每一数据域中的数值采用One-Hot编码，且每一数据域的大小依据样本特征的特征数据确定。作为一示例，针对年龄这一样本特征，可以将年龄数值进行二进制转换，以获取相应的One-Hot编码，此时，年龄这一样本特征的数据域的大小为最大年龄对应的One-Hot编码的长度。又例如，针对年龄这一样本特征，可以依据预设的年龄段划分，从而确定One-Hot编码，此时，年龄这一样本特征的数据域的大小为年龄段数量。针对城市这一样本特征，包括北京、上海、天津、重庆和广东这几个特征数据，可以分别转换为10000、01000、00100、00010和00001，此时，城市这一样本特征的数据域的大小为预先设置的特征数据的数量。During the training of the DeepFM model, each model training sample includes at least two data fields corresponding to the sample features, the values in each data field are encoded by One-Hot, and the size of each data field is determined according to the characteristic data of the sample features. . As an example, for the sample feature of age, the age value can be binary converted to obtain the corresponding One-Hot code. At this time, the size of the data field of the sample feature of age is the One-Hot code corresponding to the maximum age length. For another example, for the sample feature of age, the One-Hot coding can be determined according to the preset age group division. In this case, the size of the data field of the sample feature of age is the number of age groups. For the sample feature of the city, including the feature data of Beijing, Shanghai, Tianjin, Chongqing and Guangdong, it can be converted into 10000, 01000, 00100, 00010 and 00001 respectively. At this time, the size of the data field of the sample feature of the city is the number of preset feature data.

当前DeepFM模型训练过程中，每一模型训练样本包括至少两个数据域，每一数据域的大小依据样本特征的特征数据确定，在样本特征对应的特征数据存在时间跨度大、离散程度高或者稳定性较差等情况，使得该样本特征的数据域的大小较大，从而形成的模型训练样本的维度较高，在将模型训练样本输入DeepFM模型进行训练时，使得模型训练过程所需系统资源较多且训练时间较长；而且，由于模型训练样本的维度较高，容易出现过拟合，导致无法学习到稳定的DeepFM模型或者训练所得的DeepFM模型的输出结果准确性较低。In the current DeepFM model training process, each model training sample includes at least two data domains, and the size of each data domain is determined according to the feature data of the sample features. The feature data corresponding to the sample features has a large time span, a high degree of dispersion or stability. In the case of poor performance, the size of the data domain of the sample feature is large, and the dimension of the model training sample formed is high. When the model training sample is input into the DeepFM model for training, the system resources required for the model training process are relatively high. Moreover, due to the high dimension of the model training samples, overfitting is prone to occur, resulting in the inability to learn a stable DeepFM model or the low accuracy of the output results of the trained DeepFM model.

发明内容SUMMARY OF THE INVENTION

本发明实施例提供一种样本模型训练方法、样本生成方法、装置、设备及介质，以解决当前DeepFM模型训练所获取的模型训练样本维度较高，导致模型训练所需系统资源较多、训练时间较长及训练所得模型识别准确性较低的问题。Embodiments of the present invention provide a sample model training method, a sample generation method, an apparatus, equipment and a medium, so as to solve the problem that the model training samples obtained by the current DeepFM model training have high dimensions, resulting in more system resources and training time required for model training. Problems that are longer and the recognition accuracy of the trained model is low.

本发明实施例提供一种样本模型训练方法，包括：An embodiment of the present invention provides a sample model training method, including:

获取原始训练数据，所述原始训练数据包括样本标签和至少两个样本特征对应的特征数据；Obtain original training data, the original training data includes sample labels and feature data corresponding to at least two sample features;

将所述原始训练数据输入到基于树模型构建的初始森林模型，获取所述原始训练数据对应的One-Hot编码形式的原始高阶组合特征，所述初始森林模型包括依序排布的至少两棵特征树，每一所述特征树与一所述样本特征相对应，包括至少两个初始叶子节点；The original training data is input into the initial forest model constructed based on the tree model, and the original high-order combined features in the form of One-Hot encoding corresponding to the original training data are obtained, and the initial forest model includes at least two sequentially arranged forest models. a feature tree, each of the feature trees corresponding to one of the sample features, including at least two initial leaf nodes;

基于所述样本标签和所述原始高阶组合特征进行稳定性筛选，确定有效叶子节点，基于所述有效叶子节点对所述初始森林模型的初始叶子节点进行截枝，获取有效森林模型；Perform stability screening based on the sample label and the original high-order combination feature, determine an effective leaf node, and truncate the initial leaf node of the initial forest model based on the effective leaf node to obtain an effective forest model;

将所述原始训练数据输入到所述有效森林模型，获取所述原始训练数据对应的One-Hot编码形式的有效高阶组合特征；The original training data is input into the effective forest model, and the effective high-order combined feature of the One-Hot encoding form corresponding to the original training data is obtained;

基于所述样本标签和所述有效高阶组合特征进行LR正则化筛选，确定目标叶子节点，基于所述目标叶子节点对所述有效森林模型中的有效叶子节点进行截枝，获取目标森林模型。Perform LR regularization screening based on the sample label and the effective high-order combination feature, determine the target leaf node, and truncate the effective leaf node in the effective forest model based on the target leaf node to obtain the target forest model.

本发明实施例提供一种样本模型训练装置，包括：An embodiment of the present invention provides a sample model training device, including:

原始训练数据获取模块，用于获取原始训练数据，所述原始训练数据包括样本标签和至少两个样本特征对应的特征数据；an original training data acquisition module, used for acquiring original training data, the original training data including sample labels and feature data corresponding to at least two sample features;

原始高阶组合特征获取模块，用于将所述原始训练数据输入到基于树模型构建的初始森林模型，获取所述原始训练数据对应的One-Hot编码形式的原始高阶组合特征，所述初始森林模型包括依序排布的至少两棵特征树，每一所述特征树与一所述样本特征相对应，包括至少两个初始叶子节点；The original high-order combined feature acquisition module is used to input the original training data into the initial forest model constructed based on the tree model, and obtain the original high-level combined feature in the form of One-Hot encoding corresponding to the original training data. The forest model includes at least two feature trees arranged in sequence, each of the feature trees corresponds to one of the sample features, and includes at least two initial leaf nodes;

有效森林模型获取模块，用于基于所述样本标签和所述原始高阶组合特征进行稳定性筛选，确定有效叶子节点，基于所述有效叶子节点对所述初始森林模型的初始叶子节点进行截枝，获取有效森林模型；The effective forest model acquisition module is used to perform stability screening based on the sample label and the original high-order combination feature, determine effective leaf nodes, and truncate the initial leaf nodes of the initial forest model based on the effective leaf nodes. , to obtain a valid forest model;

有效高阶组合特征获取模块，用于将所述原始训练数据输入到所述有效森林模型，获取所述原始训练数据对应的One-Hot编码形式的有效高阶组合特征；an effective high-order combined feature acquisition module, used for inputting the original training data into the effective forest model, to obtain an effective high-order combined feature in the form of One-Hot encoding corresponding to the original training data;

目标森林模型获取模块，用于基于所述样本标签和所述有效高阶组合特征进行LR正则化筛选，确定目标叶子节点，基于所述目标叶子节点对所述有效森林模型中的有效叶子节点进行截枝，获取目标森林模型。The target forest model acquisition module is used to perform LR regularization screening based on the sample label and the effective high-order combination feature, determine the target leaf node, and perform the effective leaf node in the effective forest model based on the target leaf node. Cut the branches to obtain the target forest model.

一种计算机设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述样本模型训练方法。A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the above-mentioned sample model training method when the processor executes the computer program.

一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述样本模型训练方法。A computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, implements the above-mentioned sample model training method.

本发明实施例提供一种样本生成方法，包括：An embodiment of the present invention provides a sample generation method, including:

获取待处理数据，所述待处理数据包括至少两个样本特征对应的特征数据；Obtaining data to be processed, the data to be processed includes feature data corresponding to at least two sample features;

将至少两个样本特征对应的特征数据输入上述样本模型训练方法确定的目标森林模型，将所述目标森林模型输出的One-Hot编码形式的目标高阶组合特征，确定为DeepFM模型的模型训练样本。The feature data corresponding to at least two sample features is input into the target forest model determined by the above-mentioned sample model training method, and the target high-order combined feature in the form of One-Hot encoding output by the target forest model is determined as the model training sample of the DeepFM model .

本发明实施例提供一种样本生成装置，包括：An embodiment of the present invention provides a sample generation device, including:

待处理数据获取模块，用于获取待处理数据，所述待处理数据包括至少两个样本特征对应的特征数据；a data acquisition module to be processed, configured to acquire data to be processed, the data to be processed includes feature data corresponding to at least two sample features;

模型训练样本获取模块，用于将至少两个样本特征对应的特征数据输入上述样本模型训练方法确定的目标森林模型，将所述目标森林模型输出的One-Hot编码形式的目标高阶组合特征，确定为DeepFM模型的模型训练样本。The model training sample acquisition module is used for inputting the feature data corresponding to at least two sample features into the target forest model determined by the above-mentioned sample model training method, and inputting the target high-order combined features in the form of One-Hot encoding output by the target forest model, Model training samples identified as DeepFM models.

一种计算机设备，包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序，所述处理器执行所述计算机程序时实现上述样本生成方法。A computer device includes a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the above-described sample generation method when the computer program is executed.

一种计算机可读存储介质，所述计算机可读存储介质存储有计算机程序，所述计算机程序被处理器执行时实现上述样本生成方法。A computer-readable storage medium storing a computer program, when the computer program is executed by a processor, implements the above-mentioned sample generation method.

上述样本模型训练方法、装置、设备及介质中，训练所得的目标森林模型，可以实现对原始训练数据中至少两个样本特征的特征数据转换成包括至少两个数据域的One-Hot编码形式的高阶组合特征，使其可以输入DeepFM模型进行模型训练；而且，由于目标森林模型是对初始森林模型中的初始叶子节点经过稳定性筛选和LR正则化筛选确定的森林模型，经过二次筛选降维，使得形成的高阶组合特征的维度较低，使其输出到DeepFM模型进行模型训练时，可以节省训练过程占用的系统资源，并缩短训练时长；而且，目标森林模型中的目标叶子节点与DeepFM模型的模型训练目的相匹配，并过滤稳定性较低的叶子节点，减少目标森林模型输出的模型训练样本在DeepFM模型训练过程中出现过拟合现象，节省模型训练过程所占用的系统资源，有助于提高模型训练样本训练DeepFM模型的准确性。In the above-mentioned sample model training method, device, equipment and medium, the target forest model obtained by training can convert the feature data of at least two sample features in the original training data into a One-Hot encoding format including at least two data domains. The high-order combined features enable it to be input into the DeepFM model for model training; moreover, since the target forest model is a forest model determined by stability screening and LR regularization screening of the initial leaf nodes in the initial forest model, it can be reduced by secondary screening. dimension, so that the dimension of the formed high-order combined feature is lower, so that when it is output to the DeepFM model for model training, the system resources occupied by the training process can be saved and the training time can be shortened; moreover, the target leaf node in the target forest model is the same as The model training purpose of the DeepFM model is matched, and the leaf nodes with low stability are filtered to reduce the overfitting phenomenon of the model training samples output by the target forest model during the DeepFM model training process, saving the system resources occupied by the model training process. Helps to improve the accuracy of training DeepFM models with model training samples.

上述样本生成方法、装置、设备及介质中，采用上述实施例确定的目标森林模型对待处理数据至少两个样本特征的特征数据转换成包括至少两个数据域的One-Hot编码形式的高阶组合特征，使其形成可以输入DeepFM模型进行模型训练的模型训练样本；而且，由于目标森林模型是对初始森林模型中的初始叶子节点经过稳定性筛选和LR正则化筛选确定的森林模型，经过二次筛选降维，使得形成的高阶组合特征的维度较低，使其输出的模型训练样本输入到DeepFM模型进行模型训练时，可以节省训练过程占用的系统资源，并缩短训练时长；而且，目标森林模型中的目标叶子节点与DeepFM模型的模型训练目的相匹配，并过滤稳定性较低的叶子节点，减少目标森林模型输出的模型训练样本在DeepFM模型训练过程中出现过拟合现象，节省模型训练过程所占用的系统资源，有助于提高模型训练样本训练DeepFM模型的准确性。In the above-mentioned sample generation method, device, device and medium, the target forest model determined in the above-mentioned embodiment is used to convert the feature data of at least two sample features of the data to be processed into a high-order combination in the form of One-Hot encoding including at least two data fields. feature, so that it can form a model training sample that can be input into the DeepFM model for model training; moreover, since the target forest model is a forest model determined by stability screening and LR regularization screening of the initial leaf nodes in the initial forest model, after secondary Screening and dimensionality reduction makes the dimension of the formed high-order combined features lower, so that when the output model training samples are input into the DeepFM model for model training, the system resources occupied by the training process can be saved and the training time can be shortened; moreover, the target forest The target leaf nodes in the model match the model training purpose of the DeepFM model, and filter the leaf nodes with low stability to reduce the model training samples output by the target forest model. Overfitting occurs during the training of the DeepFM model, saving model training. The system resources occupied by the process help to improve the accuracy of the DeepFM model trained by the model training samples.

附图说明Description of drawings

为了更清楚地说明本发明实施例的技术方案，下面将对本发明实施例的描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动性的前提下，还可以根据这些附图获得其他的附图。In order to illustrate the technical solutions of the embodiments of the present invention more clearly, the following briefly introduces the drawings that are used in the description of the embodiments of the present invention. Obviously, the drawings in the following description are only some embodiments of the present invention. , for those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative labor.

图1是本发明一实施例中计算机设备的一示意图；1 is a schematic diagram of a computer device in an embodiment of the present invention;

图2是本发明一实施例中样本模型训练方法的一流程图；2 is a flowchart of a method for training a sample model according to an embodiment of the present invention;

图3是本发明一实施例中样本模型训练方法的另一流程图；3 is another flowchart of a sample model training method in an embodiment of the present invention;

图4是本发明一实施例中初始森林模型中一特征树的一示意图；4 is a schematic diagram of a feature tree in an initial forest model according to an embodiment of the present invention;

图5是本发明一实施例中样本模型训练方法的另一流程图；5 is another flowchart of a sample model training method in an embodiment of the present invention;

图6是本发明一实施例中样本模型训练方法的另一流程图；FIG. 6 is another flowchart of a sample model training method in an embodiment of the present invention;

图7是本发明一实施例中样本生成方法的另一流程图；7 is another flowchart of a sample generation method in an embodiment of the present invention;

图8是本发明一实施例中样本模型训练装置的一示意图；8 is a schematic diagram of a sample model training apparatus in an embodiment of the present invention;

图9是本发明一实施例中样本生成装置的一示意图。FIG. 9 is a schematic diagram of a sample generating apparatus according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例提供的样本模型训练方法，该样本模型训练方法可应用如图1所示的计算机设备，该计算机设备可以是服务器。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储执行样本模型训练方法过程中采用或者生成的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种样本模型训练方法。The sample model training method provided by the embodiment of the present invention may be applied to the computer device shown in FIG. 1 , and the computer device may be a server. The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store data used or generated during the execution of the sample model training method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by the processor, implements a sample model training method.

在一实施例中，如图2所示，提供一种样本模型训练方法，以该样本模型训练方法应用在图1中的计算机设备为例进行说明，该样本模型训练方法具体包括如下步骤：In one embodiment, as shown in FIG. 2 , a sample model training method is provided, and the sample model training method is applied to the computer device in FIG. 1 as an example for description. The sample model training method specifically includes the following steps:

S201：获取原始训练数据，原始训练数据包括样本标签和至少两个样本特征对应的特征数据。S201: Obtain original training data, where the original training data includes sample labels and feature data corresponding to at least two sample features.

其中，原始训练数据是指未经处理的用于训练样本模型的数据。样本标签是预先标注的用于反映样本模型训练目的的标签。例如，在训练评估用户访问系统的基于DeepFM的用户画像分析模型时，若模型训练目的是分析用户是否访问系统，则依据实际情况标注原始训练数据对应的样本标签为访问或者不访问，将其样本标签分别设置为1或0；若模型训练目的是分析用户是否有意向购买特定产品，则依据实际情况标注原始训练数据对应的样本标签为购买和不购买，将其样本标签分别设置为1或0；若模型训练目的是分析用户的业绩高低，则依据实际情况标注原始训练数据对应的样本标签为高业绩和低业绩，将其样本标签分别设置为1或0。Among them, the original training data refers to the unprocessed data used to train the sample model. Sample labels are pre-labeled labels that reflect the training purpose of the sample model. For example, when training a DeepFM-based user portrait analysis model for evaluating user access to the system, if the purpose of model training is to analyze whether the user accesses the system, label the sample corresponding to the original training data as access or no access according to the actual situation. The labels are set to 1 or 0 respectively; if the purpose of model training is to analyze whether the user intends to purchase a specific product, the sample labels corresponding to the original training data are marked as purchase and non-purchase according to the actual situation, and the sample labels are set to 1 or 0 respectively. ; If the purpose of model training is to analyze the user's performance, label the sample labels corresponding to the original training data as high performance and low performance according to the actual situation, and set their sample labels to 1 or 0 respectively.

样本特征是指原始训练数据中的特征维度。特征数据是指指原始训练数据中样本特征对应的具体数值或者具体信息。作为一示例，在训练用户画像分析模型时，其对应的原始训练数据可以包括年龄、性别、收入、学历和其他样本特征，相应地，样本特征对应的特征数据为年龄、性别、收入、学历和其他样本特征的具体数值或者具体信息。例如，可以采用key-value形式存储至少两个样本特征及其对应的特征数据，如年龄-30、性别-男、收入-10000、学历-本科等。Sample features refer to the feature dimensions in the original training data. Feature data refers to specific values or specific information corresponding to sample features in the original training data. As an example, when training a user portrait analysis model, its corresponding original training data may include age, gender, income, education, and other sample features. Correspondingly, the feature data corresponding to the sample features are age, gender, income, education, and Specific values or specific information of other sample features. For example, at least two sample features and their corresponding feature data can be stored in the form of key-value, such as age-30, gender-male, income-10000, education background-undergraduate, etc.

S202：将原始训练数据输入到基于树模型构建的初始森林模型，获取原始训练数据对应的One-Hot编码形式的原始高阶组合特征，初始森林模型包括依序排布的至少两棵特征树，每一特征树与一样本特征相对应，包括至少两个初始叶子节点。S202: Input the original training data into the initial forest model constructed based on the tree model, and obtain the original high-order combined features in the form of One-Hot encoding corresponding to the original training data. The initial forest model includes at least two feature trees arranged in sequence, Each feature tree corresponds to a sample feature and includes at least two initial leaf nodes.

其中，基于树模型构建的初始森林模型是指采用XGBOOST、LightGBM或者其他树模型构建的包含依序排布的至少两个特征树的森林模型。每一棵特征树对应一样本特征，是用于对样本特征对应的特征数据进行分类划分的树。初始森林模型中，每一棵特征树中的叶子节点为初始叶子节点。原始高阶组合特征为原始训练数据输入到初始森林模型后的输出结果。The initial forest model constructed based on the tree model refers to a forest model constructed by using XGBOOST, LightGBM or other tree models and including at least two feature trees arranged in sequence. Each feature tree corresponds to a sample feature, and is a tree used to classify and divide the feature data corresponding to the sample feature. In the initial forest model, the leaf nodes in each feature tree are initial leaf nodes. The original high-order combined feature is the output result after the original training data is input into the initial forest model.

具体地，若基于树模型构建的初始森林模型包括N棵特征树，第i棵特征树Y_i对应的初始叶子节点为

个，则基于该初始森林模型形成的初始叶子节点的数量为

相应地，将原始训练数据的至少两个样本特征对应的特征数据输入到初始森林模型进行分类，将样本特征对应的特征数据在初始森林模型中对应的初始叶子节点的值确定为1，其余初始叶子节点对应的值确定为0，以形成原始训练数据对应的One-Hot编码形式的原始高阶组合特征，该原始高阶组合特征包括N个数据域，每个数据域对应一样本特征，且每个数据域的大小与对应的样本特征对应的特征树的初始叶子节点相匹配，即第i个数据域的大小为

所形成的One-Hot编码形式的原始高阶组合特征的维度为

Specifically, if the initial forest model constructed based on the tree model includes N feature trees, the initial leaf node corresponding to the ith feature tree Y _i is

, then the number of initial leaf nodes formed based on the initial forest model is

Correspondingly, the feature data corresponding to at least two sample features of the original training data are input into the initial forest model for classification, and the value of the initial leaf node corresponding to the feature data corresponding to the sample feature in the initial forest model is determined to be 1, and the remaining initial values are determined as 1. The value corresponding to the leaf node is determined to be 0 to form the original high-order combined feature in the form of One-Hot encoding corresponding to the original training data. The original high-order combined feature includes N data domains, each data domain corresponds to a sample feature, and The size of each data field matches the initial leaf node of the feature tree corresponding to the corresponding sample feature, that is, the size of the i-th data field is

The dimension of the original high-order combined feature in the formed One-Hot encoding form is

S203：基于样本标签和原始高阶组合特征进行稳定性筛选，确定有效叶子节点，基于有效叶子节点对初始森林模型的初始叶子节点进行截枝，获取有效森林模型。S203: Perform stability screening based on the sample label and the original high-order combination feature, determine effective leaf nodes, and truncate the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain an effective forest model.

由于原始训练数据中的至少两个样本特征对应的特征数据的分布可能存在时间跨度大、离散程度高或者稳定性差等情况，若直接基于初始森林模型输出的原始高阶组合特征作为DeepFM模型的模型训练样本，可能会影响后续DeepFM模型的模型训练的准确性和时效性，因此，需要基于样本标签和原始高阶组合特征进行稳定性筛选，具体是指所有原始高阶组合特征中同一样本特征对应的数据域的数值与样本标签进行稳定性筛选，以确定该数据域中相对稳定的数值范围，基于该数值范围确定其在初始森林模型中对应的初始叶子节点为有效叶子节点。然后，再基于有效叶子节点对初始森林模型中的初始叶子节点进行截枝，即将初始森林模型中有效叶子节点保留，将除有效叶子节点之外的其他初始叶子节点删除或者合并，以形成有效森林模型。该有效森林模型包括至少两个有效叶子节点，使得原始训练数据输入到有效森林模型形成的高阶组合特征的维度较低且稳定性较好，从而避免后续DeepFM模型训练过程中出现过拟合现象。Since the distribution of the feature data corresponding to at least two sample features in the original training data may have a large time span, a high degree of dispersion, or poor stability, if the original high-order combined feature output directly based on the initial forest model is used as the model of the DeepFM model The training samples may affect the accuracy and timeliness of model training of the subsequent DeepFM model. Therefore, it is necessary to perform stability screening based on the sample labels and the original high-order combined features. Specifically, all the original high-order combined features correspond to the same sample feature. The value of the data domain and the sample label are subjected to stability screening to determine the relatively stable numerical range in the data domain, and based on the numerical range, the corresponding initial leaf node in the initial forest model is determined as an effective leaf node. Then, the initial leaf nodes in the initial forest model are truncated based on the valid leaf nodes, that is, the valid leaf nodes in the initial forest model are retained, and other initial leaf nodes except the valid leaf nodes are deleted or merged to form a valid forest. Model. The effective forest model includes at least two effective leaf nodes, so that the high-order combined features formed by inputting the original training data into the effective forest model have lower dimensions and better stability, thereby avoiding overfitting in the subsequent DeepFM model training process. .

作为一示例，设一样本特征为访问时间，则初始森林模型中访问时间对应的特征树可以是依据特征判断条件将访问时间划分成24个小时对应的时间区间，则针对访问时间对应的初始叶子节点有24个，该原始高阶组合特征的维度较长，不利于提高后续模型训练的准确性和时效性；而且，针对特定用户而言，其访问系统的访问时间相对固定，24：00-6：00这一时间段内访问系统的次数较少，甚至为0，则在其他时间段访问系统的次数较多，使得访问时间这一样本特征存在时间跨度大、离散程度高或者稳定性差等情况出现，基于样本标签和访问时间这一样本特征在原始高阶组合特征中的内容进行稳定性筛选，可将6：00-24：00这一时间段在初始森林模型中的初始叶子节点确定为有效叶子节点，将24：00-6：00这一时间段对应的五个初始叶子节点删除或者合并为一个有效叶子节点，基于有效叶子节点形成有效森林模型，从而降低有效森林模型输入的高阶组合特征的维度，以保证后续DeepFM模型训练过程的时效性，并节省所需系统资源，避免出现过拟合现象。As an example, assuming that the sample feature is access time, the feature tree corresponding to the access time in the initial forest model can be divided into time intervals corresponding to 24 hours according to the feature judgment condition, then for the initial leaf corresponding to the access time There are 24 nodes, and the original high-order combined feature has a long dimension, which is not conducive to improving the accuracy and timeliness of subsequent model training; moreover, for a specific user, the access time to the system is relatively fixed, 24:00- During the 6:00 time period, the number of accesses to the system is small, or even 0, and the number of accesses to the system in other time periods is more, which makes the sample characteristics of access time have large time span, high degree of dispersion or poor stability, etc. When the situation arises, stability screening is performed based on the content of the sample features such as sample labels and access time in the original high-order combined features, and the initial leaf nodes in the initial forest model during the period of 6:00-24:00 can be determined. To be an effective leaf node, delete or merge the five initial leaf nodes corresponding to the time period of 24:00-6:00 into one effective leaf node, and form an effective forest model based on the effective leaf nodes, thereby reducing the input height of the effective forest model. The dimension of the order combination feature to ensure the timeliness of the subsequent DeepFM model training process, save the required system resources, and avoid overfitting.

本示例中，若基于树模型构建的初始森林模型包括N棵特征树，第i棵特征树Y_i对应的初始叶子节点为

个，在基于样本标签和原始高阶组合特征进行稳定性筛选后，第i棵特征树Y_i对应的有效叶子节点为

个，则形成的有效森林模型中的有效叶子节点的数量为

其中，

有助于实现对初始森林模型进行第一次降维，使得最终形成的模型训练样本的维度较低，有助于保障后续DeepFM模型训练的准确性和时效性，节省模型训练所需的系统资源。In this example, if the initial forest model constructed based on the tree model includes N feature trees, the initial leaf node corresponding to the ith feature tree Y _i is

After the stability screening based on the sample labels and the original high-order combined features, the effective leaf node corresponding to the i-th feature tree Y _i is

, then the number of effective leaf nodes in the formed effective forest model is

in,

It is helpful to realize the first dimensionality reduction of the initial forest model, so that the final model training sample has a lower dimension, which helps to ensure the accuracy and timeliness of subsequent DeepFM model training, and saves system resources required for model training. .

S204：将原始训练数据输入到有效森林模型，获取原始训练数据对应的One-Hot编码形式的有效高阶组合特征。S204: Input the original training data into an effective forest model, and obtain an effective high-order combined feature in the form of One-Hot encoding corresponding to the original training data.

其中，有效高阶组合特征为原始训练数据输入到有效森林模型后的输出结果。作为一示例，将原始训练数据的至少两个样本特征对应的特征数据输入到有效森林模型进行分类，将样本特征对应的特征数据在有效森林模型对应的有效叶子节点的值确定为1，其余有效叶子节点对应的值确定为0，以形成原始训练数据对应的One-Hot编码形式的有效高阶组合特征，该有效高阶组合特征包括N个数据域，第i个数据域的大小为

所形成的One-Hot编码形式的原始高阶组合特征的维度为

其中，

以实现对原始高阶组合特征进行降维，以便进行后续分析处理，使得最终形成的模型训练样本的维度较低，有助于保障后续DeepFM模型训练的准确性和时效性。Among them, the effective high-order combined feature is the output result after the original training data is input into the effective forest model. As an example, the feature data corresponding to at least two sample features of the original training data is input into the valid forest model for classification, and the value of the feature data corresponding to the sample feature in the valid leaf node corresponding to the valid forest model is determined to be 1, and the rest are valid The value corresponding to the leaf node is determined to be 0 to form an effective high-order combined feature in the form of One-Hot encoding corresponding to the original training data. The effective high-order combined feature includes N data fields, and the size of the i-th data field is

in,

In order to achieve dimensionality reduction of the original high-order combined features for subsequent analysis and processing, the dimension of the final model training sample is lower, which helps to ensure the accuracy and timeliness of subsequent DeepFM model training.

S205：基于样本标签和有效高阶组合特征进行LR正则化筛选，确定目标叶子节点，基于目标叶子节点对有效森林模型中的有效叶子节点进行截枝，获取目标森林模型。S205: Perform LR regularization screening based on the sample label and the effective high-order combination feature, determine the target leaf node, and truncate the effective leaf node in the effective forest model based on the target leaf node to obtain the target forest model.

由于有效森林模型中的有效叶子节点可能与后续DeepFM模型的模型训练目的存在不相关的情况，由于DeepFM模型的模型训练目的与样本标签相关联，因此，可基于样本标签和有效高阶组合特征进行LR正则化筛选，以从所有有效叶子节点中筛选出与模型训练目的高度相关的叶子节点，确定为目标叶子节点。然后，再基于目标叶子节点对有效森林模型进行截枝，即将有效森林模型中目标叶子节点保留，将除目标叶子节点之外的有效叶子节点删除或者合并，以形成目标森林模型。该目标森林模型是模型训练完成的用于生成DeepFM模型的模型训练样本对应的森林模型。该目标森林模型为对有效森林模型进一步降维所形成的森林模型，可使目标森林模型输出的模型训练样本为基于所有目标叶子节点形成的包含至少两个数据域对应的数据内容，有助于保障最终形成目标森林模型输出的模型训练样本，在DeepFM模型训练的准确性和时效性，有效降低模型训练样本的维度，使其模型训练过程中占用系统资源较少，且避免过拟合现象出现。Since the effective leaf nodes in the effective forest model may be irrelevant to the model training purpose of the subsequent DeepFM model, since the model training purpose of the DeepFM model is related to the sample label, it can be based on the sample label and the effective high-order combined feature. LR regularization screening is to screen out the leaf nodes that are highly related to the model training purpose from all valid leaf nodes, and determine them as the target leaf nodes. Then, the effective forest model is truncated based on the target leaf node, that is, the target leaf node in the valid forest model is retained, and the valid leaf nodes except the target leaf node are deleted or merged to form the target forest model. The target forest model is the forest model corresponding to the model training samples for generating the DeepFM model after model training. The target forest model is a forest model formed by further reducing the dimensionality of the effective forest model, so that the model training samples output by the target forest model can be formed based on all target leaf nodes and contain at least two data fields corresponding to the data content, which is helpful for Guarantee the model training samples output by the target forest model, and effectively reduce the dimension of the model training samples in the accuracy and timeliness of DeepFM model training, so that the model training process occupies less system resources and avoids overfitting. .

本实施例所提供的样本模型训练方法所训练出的目标森林模型，可以实现对原始训练数据中至少两个样本特征的特征数据转换成包括至少两个数据域的One-Hot编码形式的高阶组合特征，使其可以输入DeepFM模型进行模型训练；而且，由于目标森林模型是对初始森林模型中的初始叶子节点经过稳定性筛选和LR正则化筛选确定的森林模型，经过二次筛选降维，使得形成的高阶组合特征的维度较低，使其输出到DeepFM模型进行模型训练时，可以节省训练过程占用的系统资源，并缩短训练时长；而且，目标森林模型中的目标叶子节点与DeepFM模型的模型训练目的相匹配，并过滤稳定性较低的叶子节点，减少目标森林模型输出的模型训练样本在DeepFM模型训练过程中出现过拟合现象，节省模型训练过程所占用的系统资源，有助于提高模型训练样本训练DeepFM模型的准确性。The target forest model trained by the sample model training method provided in this embodiment can convert the feature data of at least two sample features in the original training data into a high-order one-hot encoding format including at least two data domains Combine the features so that they can be input into the DeepFM model for model training; moreover, since the target forest model is a forest model determined by stability screening and LR regularization screening of the initial leaf nodes in the initial forest model, after secondary screening to reduce the dimension, The dimension of the formed high-order combined features is low, so that when it is output to the DeepFM model for model training, the system resources occupied by the training process can be saved and the training time can be shortened; moreover, the target leaf nodes in the target forest model are related to the DeepFM model. The model training purpose of the target forest model matches the training purpose, and the leaf nodes with low stability are filtered to reduce the over-fitting phenomenon of the model training samples output by the target forest model during the DeepFM model training process, saving the system resources occupied by the model training process, and helping It is used to improve the accuracy of the DeepFM model trained by the model training samples.

在一实施例中，如图3所示，将原始训练数据输入到基于树模型构建的初始森林模型，获取原始训练数据对应的One-Hot编码形式的原始高阶组合特征，具体包括如下步骤：In one embodiment, as shown in Figure 3, the original training data is input into the initial forest model constructed based on the tree model, and the original high-order combination feature in the form of One-Hot encoding corresponding to the original training data is obtained, which specifically includes the following steps:

S301：将至少两个样本特征对应的特征数据，分别输入样本特征对应的特征树进行处理，获取样本特征对应的初始特征。S301: Input feature data corresponding to at least two sample features into feature trees corresponding to the sample features for processing, and obtain initial features corresponding to the sample features.

具体地，基于树模型的初始森林模型包括依序排布的至少两棵特征树，每一棵特征树对应一样本特征，可用于对样本特征对应的特征数据进行分析，以便将样本特征对应的特征数据转换成One-Hot编码形式，以方便形成可用于输入DeepFM模型进行模型训练的模型训练样本。作为一示例，特征树可以采用XGBOOST、LightGBM或者其他树模型。Specifically, the initial forest model based on the tree model includes at least two feature trees arranged in sequence, each feature tree corresponds to a sample feature, and can be used to analyze the feature data corresponding to the sample feature, so as to The feature data is converted into One-Hot encoding form to facilitate the formation of model training samples that can be used to input the DeepFM model for model training. As an example, the feature tree may adopt XGBOOST, LightGBM or other tree models.

本示例中，每一样本特征对应的特征树是基于至少一个特征判断条件形成的树，通过将样本特征对应的特征数据输入到该样本特征对应的特征树中，利用至少一个特征判断条件确定该特征数据对应的初始叶子节点，将该特征数据对应的初始叶子节点的值设置为1，则其他初始叶子节点的值设置为0，从而获取One-Hot编码形式的初始特征。In this example, the feature tree corresponding to each sample feature is a tree formed based on at least one feature judgment condition. By inputting the feature data corresponding to the sample feature into the feature tree corresponding to the sample feature, at least one feature judgment condition is used to determine the For the initial leaf node corresponding to the feature data, the value of the initial leaf node corresponding to the feature data is set to 1, and the values of other initial leaf nodes are set to 0, so as to obtain the initial feature in the form of One-Hot encoding.

例如，针对性别这一样本特征，其对应的特征判断条件可以为“性别是否为男”，以形成10表示性别为男，而01表示性别为女。For example, for the sample feature of gender, the corresponding feature judgment condition may be "whether the gender is male", so that 10 indicates that the gender is male, and 01 indicates that the gender is female.

又例如，如图4所示，将收入这一样本特征设为S，其对应的特征判断条件包括“A1：S>5000”、“A2：S>10000”、“A3：S>15000”、“A4：S>20000”、“A：S>25000”和“A6：S>30000”等6个特征判断条件，则基于这6个特征判断条件形成L1、L2、L3、L4、L5、L6和L7这7个初始叶子节点，每个初始叶子节点对应的收入范围如下：L1：S≤5000、L2：5000<S≤10000、L3：10000<S≤15000、L4：15000<S≤20000、L5：20000<S≤25000、L6：250000<S≤30000和L7：S>30000；此时，若原始训练数据中收入对应的特征数据为12000，则其落入L3这一初始叶子节点，则将L3这一初始叶子节点的值设置为1，其他初始叶子节点的值设置为0，形成0000100这一初始特征。For another example, as shown in Figure 4, the sample feature of income is set as S, and the corresponding feature judgment conditions include "A1: S>5000", "A2: S>10000", "A3: S>15000", "A4: S>20000", "A: S>25000" and "A6: S>30000" and other 6 feature judgment conditions, then based on these 6 feature judgment conditions, L1, L2, L3, L4, L5, L6 are formed And the seven initial leaf nodes of L7, the income range corresponding to each initial leaf node is as follows: L1: S≤5000, L2: 5000<S≤10000, L3: 10000<S≤15000, L4: 15000<S≤20000, L5: 20000<S≤25000, L6: 250000<S≤30000 and L7: S>30000; at this time, if the feature data corresponding to the income in the original training data is 12000, it falls into the initial leaf node L3, then The value of the initial leaf node L3 is set to 1, and the values of other initial leaf nodes are set to 0, forming the initial feature of 0000100.

S302：依据至少两棵特征树的排布顺序，对至少两个样本特征对应的初始特征进行拼接，获取原始训练数据对应的One-Hot编码形式的原始高阶组合特征。S302: According to the arrangement order of the at least two feature trees, splicing the initial features corresponding to the at least two sample features to obtain the original high-order combined features in the form of One-Hot encoding corresponding to the original training data.

本示例中，基于树模型的初始森林模型包括依序排布的至少两棵特征树，每一特征树对应一样本特征，则至少两棵特征树之间的排布顺序，反映至少两个样本特征所形成的数据域的顺序。具体地，若基于树模型构建的初始森林模型包括N棵特征树，第i棵特征树Y_i对应的初始叶子节点为

个，则依据i确定至少两棵特征树之间的排布顺序，可确定基于初始森林模型会形成的高阶组合特征的表现形式为|S₁|S₂|S₃|…|S_i|，其中，||表示数据域，S_i表示第i棵特征树Y_i输出的初始特征，具体为相应样本特征的数据域的数值。In this example, the initial forest model based on the tree model includes at least two feature trees arranged in sequence, each feature tree corresponds to a sample feature, then the arrangement order between the at least two feature trees reflects the at least two samples The order of the data fields formed by the features. Specifically, if the initial forest model constructed based on the tree model includes N feature trees, the initial leaf node corresponding to the ith feature tree Y _i is

The order of arrangement between at least two feature trees is determined according to i, and it can be determined that the expression form of the high-order combined features to be formed based on the initial forest model is |S ₁ |S ₂ |S ₃ |…|S _i | , where || represents the data domain, and S _i represents the initial feature output by the i-th feature tree Y _i , specifically the value of the data domain of the corresponding sample feature.

在一示例中，若性别为第1个样本特征，收入为第2个样本特征，此时，原始训练数据中的性别为男，收入为12000，则第1棵特征树Y₁所形成的初始特征S₁为10，第二棵特征树Y₂所形成的初始特征S₂为0000100，对至少两个样本特征对应的初始特征进行拼接，以形成|10|0000100|S₃|…|S_i|，确定为原始训练数据对应的One-Hot编码形式的原始高阶组合特征。In an example, if gender is the first sample feature and income is the second sample feature, at this time, the gender in the original training data is male and the income is 12000, then the initial value formed by the first feature tree Y ₁ is The feature S ₁ is 10, the initial feature S ₂ formed by the second feature tree Y ₂ is 0000100, and the initial features corresponding to at least two sample features are spliced to form |10|0000100|S ₃ |…|S _i |, which is determined as the original high-order combined feature in the form of One-Hot encoding corresponding to the original training data.

本实施例提供的样本模型训练方法中，可将至少两个样本特征对应的特征数据分别输入到样本特征对应的特征树进行识别处理，以将特征数据转换成One-Hot编码形式的初始特征；再依据至少两棵特征树的排布顺序对所有初始特征进行拼接，从而快速形成包括至少两个数据域的原始高阶组合特征，且每个数据域采用One-Hot编码形式的数值，以方便后续进行DeepFM模型训练。In the sample model training method provided in this embodiment, the feature data corresponding to at least two sample features can be input into the feature tree corresponding to the sample features respectively for identification processing, so as to convert the feature data into initial features in the form of One-Hot encoding; Then splicing all the initial features according to the arrangement order of at least two feature trees, so as to quickly form the original high-order combined features including at least two data fields, and each data field adopts the value in the form of One-Hot encoding, so as to facilitate the Follow-up DeepFM model training.

在一实施例中，原始训练数据还包括时间标签，该时间标签为与原始训练数据形成时间相关的信息。如图5所示，基于样本标签和原始高阶组合特征进行稳定性筛选，确定有效叶子节点，具体包括如下步骤：In one embodiment, the original training data further includes a time tag, where the time tag is information related to the time when the original training data was formed. As shown in Figure 5, stability screening is performed based on sample labels and original high-order combined features to determine effective leaf nodes, which specifically includes the following steps:

S501：基于时间标签和原始高阶组合特征进行饱和度分析，获取每一样本特征对应的饱和度分析结果。S501: Perform saturation analysis based on the time tag and the original high-order combined feature, and obtain a saturation analysis result corresponding to each sample feature.

其中，饱和度分析是从时间维度分析数据在时间分布上是否饱和，以确定数据是否稳定的过程。每一样本特征对应的饱和度分析结果为用于反映某一样本特征对应的原始高阶组合特征在时间分布上是否饱和的结果。Among them, saturation analysis is a process of analyzing whether the data is saturated in time distribution from the time dimension to determine whether the data is stable. The saturation analysis result corresponding to each sample feature is a result used to reflect whether the original high-order combined feature corresponding to a certain sample feature is saturated in time distribution.

作为一示例，可以基于每一原始高阶组合特征对应的时间标签，将所有原始高阶组合特征进行分组，由于每一原始高阶组合特征包括至少两个样本特征对应的样本特征值，每一样本特征值对应一个初始叶子节点，可以分组统计每一初始叶子节点对应的每一样本特征值的数量与所有样本特征值的数量之和的比例值，再基于不同组计算出的比例值，确定不同初始叶子节点的饱和度分析结果。例如，若一样本特征对应4个叶子节点，在第1个分组中确定第1-4个初始叶子节点对应的比例值为0％、40％、40％和20％，在第3个分组中确定第1-4个初始叶子节点对应的比例值为5％、40％、35％和20％，在第1个分组中确定第1-4个初始叶子节点对应的比例值为20％、35％、30％和15％，可以基于多个比例值之间的最大差值确定其波动大小，则第1个初始叶子节点的比例值波动比较大，第2-4个初始叶子节点的比例值的波动比较小，从而获取样本特征对应的饱和度分析结果。As an example, all original high-order combined features can be grouped based on the time label corresponding to each original high-order combined feature, since each original high-order combined feature includes sample feature values corresponding to at least two sample features, each This eigenvalue corresponds to an initial leaf node, and the ratio of the number of eigenvalues of each sample corresponding to each initial leaf node to the sum of the number of all sample eigenvalues can be counted in groups, and then determined based on the ratios calculated by different groups. Saturation analysis results of different initial leaf nodes. For example, if a sample feature corresponds to 4 leaf nodes, in the first grouping, determine the proportions corresponding to the 1st-4th initial leaf nodes are 0%, 40%, 40% and 20%, and in the third grouping Determine the proportions corresponding to the 1st-4th initial leaf nodes are 5%, 40%, 35% and 20%, and in the first group, determine the proportions corresponding to the 1st-4th initial leaf nodes are 20%, 35% %, 30% and 15%, the fluctuation size can be determined based on the maximum difference between multiple proportional values, then the proportional value of the first initial leaf node fluctuates greatly, and the proportional value of the 2-4 initial leaf node The fluctuation of is relatively small, so as to obtain the saturation analysis results corresponding to the sample characteristics.

在一具体实施方式中，步骤S501，即基于时间标签和原始高阶组合特征进行饱和度分析，获取每一样本特征对应的饱和度分析结果，具体包括如下步骤：In a specific embodiment, step S501, that is, performing saturation analysis based on the time tag and the original high-order combined feature, and obtaining the saturation analysis result corresponding to each sample feature, which specifically includes the following steps:

S5011：基于时间分组周期，对与时间标签相对应的原始高阶组合特征进行分组，获取至少两个时间特征组。S5011: Based on the time grouping period, group the original high-order combined features corresponding to the time labels to obtain at least two time feature groups.

其中，时间分组周期是预先设置用于进行时间划分的周期，可以日、周、月、季度或者年度作为时间分组周期。时间特征组是用于存储时间标签在相应的时间分组周期内的所有原始高阶组合特征的集合。本示例中，基于时间分组周期可以将原始高阶组合特征划分为T个时间特征组，T≥2。The time grouping period is a period preset for time division, and a day, a week, a month, a quarter or a year can be used as the time grouping period. The temporal feature group is a collection of all original high-order combined features used to store time tags within the corresponding temporal grouping period. In this example, the original high-order combined features can be divided into T temporal feature groups based on the time grouping period, where T≥2.

作为一示例，若时间分组周期为月，则可以将所有原始高阶组合特征划分成12个时间特征组，基于每一原始高阶组合特征的时间标签，确定其对应的时间特征组。例如，用户在1月10日访问系统，则其形成的用户画像数据(即原始训练数据)中记录的时间标签为1月10日，可以将所形成的原始高阶组合特征划分入1月对应的时间特征组。As an example, if the time grouping period is a month, all original high-order combined features can be divided into 12 time feature groups, and the corresponding time feature group is determined based on the time label of each original high-order combined feature. For example, if the user accesses the system on January 10th, the time label recorded in the user portrait data (that is, the original training data) formed is January 10th, and the original high-order combined features formed can be divided into January corresponding to temporal feature group.

S5012：统计时间特征组中原始高阶组合特征的第一特征数量，统计时间特征组中同一样本特征对应的初始叶子节点中原始高阶组合特征的第二特征数量，基于第一特征数量和第二特征数量，确定每一初始叶子节点的当前饱和度。S5012: Count the number of first features of the original high-order combined features in the temporal feature group, and count the number of second features of the original high-order combined features in the initial leaf nodes corresponding to the same sample feature in the temporal feature group, based on the number of first features and the number of first features. Two feature quantities, which determine the current saturation of each initial leaf node.

其中，第一特征数量为第t个时间特征组中所有原始高阶组合特征的数量，设为K_t，使得第t个时间特征组内，所有样本特征(即所有数据域)对应的原始高阶组合特征的数量均为K_t。在第t个时间特征组的原始高阶组合特征中，每一原始高阶组合特征包括至少两个样本特征对应的数据域，每一数据域与初始森林模型中的初始叶子节点相对应，需统计第t个时间特征组的原始高阶组合特征中，与第j个初始叶子节点相对应的原始高阶组合特征的数量，将其确定为第二特征数量K_t,j，K_t,j为第t个时间特征组中第j个初始叶子节点对应的原始高阶组合特征的数量，

g为任一数据域对应的初始叶子节点的数量。基于第一特征数量K_t和第二特征数量K_t,j，确定第t个时间特征组中第j个初始叶子节点的当前饱和度

其中，P_t,j表示第t个时间特征组中第j个初始叶子节点的当前饱和度。Among them, the first feature number is the number of all original high-order combined features in the t-th time feature group, which is set as K _t , so that in the t-th time feature group, all sample features (ie all data domains) correspond to the original high-order features. The number of order combination features are all K _t . In the original high-order combined features of the t-th time feature group, each original high-level combined feature includes at least two data fields corresponding to the sample features, and each data field corresponds to the initial leaf node in the initial forest model, which needs to be Count the number of original high-order combined features corresponding to the j-th initial leaf node in the original high-order combined features of the t-th time feature group, and determine it as the second feature number K _t,j , K _t,j is the number of original high-order combined features corresponding to the jth initial leaf node in the tth time feature group,

g is the number of initial leaf nodes corresponding to any data domain. Based on the first feature quantity K _t and the second feature quantity K _t,j , determine the current saturation of the j-th initial leaf node in the t-th temporal feature group

Among them, P _t,j represents the current saturation of the j-th initial leaf node in the t-th temporal feature group.

例如，若以月度作为时间分组周期，则形成的时间特征组的数量T＝12，1月份对应的时间特征组对应的原始高阶组合特征的第一特征数量K₁，该原始高阶组合特征包括N个数据域，若第1个数据域的大小为4，即g＝4，其对应4个初始叶子节点，则数据域的内容分别为1000、0100、0010和0001，再分别统计1000、0100、0010和0001对应的原始高阶组合特征的数量分别为K_1，1、K_1，2、K_1，3和K_1，4，则第1个初始叶子节点的当前饱和度

第2个初始叶子节点的当前饱和度

第3个初始叶子节点的当前饱和度

第4个初始叶子节点的当前饱和度

For example, if the month is used as the time grouping period, the number of time feature groups formed is T=12, and the first feature number K ₁ of the original high-order combined feature corresponding to the time feature group corresponding to January is the original high-order combined feature. Including N data fields, if the size of the first data field is 4, that is, g=4, which corresponds to 4 initial leaf nodes, the contents of the data fields are 1000, 0100, 0010 and 0001 respectively, and then count 1000, The number of original high-order combined features corresponding to 0100, 0010 and 0001 are K _1,1 , K _1,2 , K _1,3 and K _1,4 respectively, then the current saturation of the first initial leaf node

The current saturation of the 2nd initial leaf node

The current saturation of the 3rd initial leaf node

The current saturation of the 4th initial leaf node

S5013：对至少两个时间特征组中，同一初始叶子节点的当前饱和度进行标准差计算，获取每一样本特征对应的饱和度分析结果。S5013: Perform standard deviation calculation on the current saturation of the same initial leaf node in at least two temporal feature groups, and obtain a saturation analysis result corresponding to each sample feature.

本示例中，将T个时间特征组中，同一初始叶子节点对应的当前饱和度进行标准差计算，即利用标准差计算公式对T个时间特征组中同一初始叶子节点对应的当前饱和度进行计算，以确定该初始叶子节点对应的饱和度标准差，将该饱和度标准差确定为该初始叶子节点对应的饱和度分析结果。一般来说，饱和度标准差越小，说明所有原始高阶组合特征中该初始叶子节点在时间分布上越均匀稳定，波动越小；饱和度标准差越大，说明所有原始高阶组合特征中该初始叶子节点在时间分布上越不均匀稳定，波动越大。In this example, the standard deviation of the current saturation corresponding to the same initial leaf node in the T time feature groups is calculated, that is, the standard deviation calculation formula is used to calculate the current saturation corresponding to the same initial leaf node in the T time feature groups , to determine the saturation standard deviation corresponding to the initial leaf node, and determine the saturation standard deviation as the saturation analysis result corresponding to the initial leaf node. Generally speaking, the smaller the saturation standard deviation, the more uniform and stable the initial leaf node is in the time distribution of all the original high-order combined features, and the smaller the fluctuation; the larger the saturation standard deviation, the more the original high-order combined features The more uneven and stable the initial leaf node is in the time distribution, the greater the fluctuation.

例如，在统计第1个初始叶子节点的饱和度分析结果时，需将第1……T个时间特征组中的第1个初始叶子节点的当前饱和度进行标准化差计算，即利用标准差计算公式对P_1，1、P_2，1……P_T，1进行标准差计算，获取第1个初始叶子节点的饱和度标准差……依此类推，获取初始森林模型中所有初始叶子节点对应的饱和度标准差，从而获取每一样本特征对应的饱和度分析结果。For example, when calculating the saturation analysis results of the first initial leaf node, the current saturation of the first initial leaf node in the 1st...T time feature group needs to be calculated by standard deviation, that is, using the standard deviation calculation The formula calculates the standard deviation for P ₁ , 1 , P ₂ , 1 ...... P _{T, 1} , and obtains the saturation standard deviation of the first initial leaf node... and so on, obtains the corresponding correspondence of all initial leaf nodes in the initial forest model The saturation standard deviation of , so as to obtain the saturation analysis result corresponding to each sample feature.

S502：基于样本标签和原始高阶组合特征进行重要性分析，获取每一样本特征对应的重要性分析结果。S502: Perform importance analysis based on the sample label and the original high-order combined feature, and obtain the importance analysis result corresponding to each sample feature.

其中，重要性分析是用于根据某一样本特征对模型训练目的的影响的重要性程度的过程。每一样本特征对应的重要性分析结果可以理解为分析某一样本特征影响模型训练目的的影响的程度。例如，若模型训练目的是用于分析用户在特征时间(如凌晨)是否访问系统的用户画像分析模型时，可以分析出用户的职业这一样本特征比性别这一样本特征对用户是否访问系统的影响更高；又例如，在分析用户是否购买某一产品时，可以分析出用户的性别这一样本特征比职业这一样本特征对用户是否购买产品的影响更高。一般来说，若任一样本特征对模型训练目标的影响越高，其重要性分析结果越好，越需要保留相应样本特征对应的叶子节点，因此，可以基于样本标签确定的模型训练目的对原始高阶组合特征进行重要性分析，以确定每一样本特征对应的重要性分析结果。Among them, importance analysis is a process used for the degree of importance of the influence of a certain sample feature on the purpose of model training. The importance analysis result corresponding to each sample feature can be understood as analyzing the degree of influence of a certain sample feature on the purpose of model training. For example, if the purpose of model training is to analyze whether a user accesses the system at a characteristic time (such as the early morning), the user profile analysis model can analyze whether the sample feature of the user's occupation has a greater impact on whether the user accesses the system than the sample feature of gender. The impact is higher; for another example, when analyzing whether a user purchases a certain product, it can be analyzed that the sample feature of the user's gender has a higher impact on whether the user purchases the product than the sample feature of the occupation. Generally speaking, if the influence of any sample feature on the model training target is higher, the result of its importance analysis is better, and the leaf node corresponding to the corresponding sample feature needs to be preserved. The high-order combined features are used for importance analysis to determine the corresponding importance analysis results for each sample feature.

在一具体实施方式中，步骤S502，即基于样本标签和原始高阶组合特征进行重要性分析，获取每一样本特征对应的重要性分析结果，具体包括如下步骤：In a specific embodiment, step S502, that is, performing importance analysis based on the sample label and the original high-order combined feature, and obtaining the importance analysis result corresponding to each sample feature, which specifically includes the following steps:

S5021：从样本标签与模型训练目的相匹配的原始高阶组合特征中，统计同一样本特征对应的初始叶子节点中原始高阶组合特征的第三特征数量，将第三特征数量最大的样本特征值确定为样本特征对应的标准特征值。S5021: From the original high-order combined features whose sample labels match the model training purpose, count the number of third features of the original high-order combined features in the initial leaf nodes corresponding to the same sample feature, and calculate the sample feature value with the largest number of third features Determine the standard feature value corresponding to the sample feature.

本示例中，只对样本标签与模型训练目的相匹配的原始高阶组合特征进行分析，以保证最终训练所得的目标森林模型可更准确地从多个样本特征中提取与模型训练目的相关的模型训练样本，有助于保障后续模型训练的准确性和时效性。例如，若模型训练目的是用于分析用户是否有意向购买某一产品，则需要提取的样本标签为已购买的原始高阶组合特征进行后续分析，而不提取样本标签为不购买的原始高阶组合特征进行后续分析。In this example, only the original high-order combined features whose sample labels match the model training purpose are analyzed to ensure that the final trained target forest model can more accurately extract models related to the model training purpose from multiple sample features Training samples help ensure the accuracy and timeliness of subsequent model training. For example, if the purpose of model training is to analyze whether a user intends to purchase a certain product, the sample labels that need to be extracted are the original high-order combination features that have been purchased for subsequent analysis, and the original high-order combination features that are not to be extracted need to be extracted. Combine features for subsequent analysis.

作为一示例，该原始高阶组合特征包括N个数据域，第i个数据域的大小为

每一数据域与一样本特征相对应，第i个数据域的大小

与第i个样本特征对应的初始叶子节点的数量相对应。因此，步骤S5021具体可包括如下步骤：首先，可通过统计第i个样本特征对应的初始叶子节点中原始高阶组合特征的第三特征数量，即统计分类在第i个样本特征对应的第j个初始叶子节点中的原始高阶组合特征的第三特征数量L_i,j，例如，若第1个数据域的大小为4，其对应4个初始叶子节点，则数据域的内容分别为1000、0100、0010和0001，则分别统计第一数据域的样本特征值为1000的第一特征数量L_1,1、第一数据域的样本特征值为0100的第一特征数量L_1,2、第一数据域的样本特征值为0010的第一特征数量L_1,3和第一数据域的样本特征值为0001的第一特征数量L_1,4。然后，将第三特征数量最大的样本特征值确定为样本特征对应的标准特征值，即将L_1,1、L_1,2、L_1,3和L_1,4中最大值对应的样本特征值确定为标准特征值，例如，若L_1,1为最大值，则将1000确定为第1个样本特征的标准特征值。可以理解地，每一样本特征对应的标准特征值为0/1形式的特征值。As an example, the original high-order combined feature includes N data fields, and the size of the i-th data field is

Each data field corresponds to a sample feature, the size of the i-th data field

Corresponds to the number of initial leaf nodes corresponding to the ith sample feature. Therefore, step S5021 may specifically include the following steps: First, by counting the third feature quantity of the original high-order combined feature in the initial leaf node corresponding to the i-th sample feature, that is, the statistical classification is in the j-th sample feature corresponding to the i-th sample feature. The third feature quantity _Li,j of the original high-order combined features in the initial leaf nodes. For example, if the size of the first data field is 4, which corresponds to 4 initial leaf nodes, the content of the data field is 1000 respectively. , 0100, 0010 and 0001, then count the first feature number L _1,1 with the sample feature value of the first data domain as 1000, the first feature number L _1,2 with the sample feature value of the first data domain as 0100, The sample feature value of the first data domain is the first feature number L _1,3 with the sample feature value of 0010 and the first feature number L _1,4 with the sample feature value of the first data domain being 0001. Then, the sample eigenvalue with the largest number of third features is determined as the standard eigenvalue corresponding to the sample feature, that is, the sample eigenvalue corresponding to the maximum value among L _1,1 , L _1,2 , L _1,3 and L _1,4 It is determined as the standard feature value. For example, if L _1,1 is the maximum value, 1000 is determined as the standard feature value of the first sample feature. Understandably, the standard feature value corresponding to each sample feature is a feature value in the form of 0/1.

S5022：原始高阶组合特征中每一样本特征对应的样本特征值与标准特征值，确定每一样本特征的当前相关系数。S5022: The sample feature value and the standard feature value corresponding to each sample feature in the original high-order combined feature determine the current correlation coefficient of each sample feature.

其中，当前相关系数是指将每一原始高阶组合特征中的样本特征对应的样本特征值与标准特征值进行相关性计算所获取的相关系数。由于原始高阶组合特征中的每一样本特征对应的样本特征值和标准特征值均为0和1的二值数据，因此可以采用Jaccard系数判断两者的相关性，本示例中，作为一示例，每一样本特征的当前相关系数为

其中，J(A,B)为当前相关系数，A和B分别指两个需要进行相关性计算的标准特征值和样本特征值，M00表示A和B对应位都是0的数量，M01表示A为0，B中对应位为1的数量，M10表示A为1，B中对应位为0的数量，M11表示A和B对应位都是1的数量。The current correlation coefficient refers to a correlation coefficient obtained by performing correlation calculation between the sample feature value corresponding to the sample feature in each original high-order combined feature and the standard feature value. Since the sample eigenvalue and the standard eigenvalue corresponding to each sample feature in the original high-order combined feature are binary data of 0 and 1, the Jaccard coefficient can be used to judge the correlation between the two. In this example, as an example , the current correlation coefficient of each sample feature is

Among them, J(A, B) is the current correlation coefficient, A and B respectively refer to two standard eigenvalues and sample eigenvalues that need to be calculated for correlation, M00 indicates the number of bits corresponding to both A and B, and M01 indicates that A is 0, the number of corresponding bits in B is 1, M10 represents the number of 1s in A, and the corresponding bits in B are 0, and M11 represents the number of corresponding bits in both A and B.

S5023：对所有原始高阶组合特征对应的当前相关系数进行标准差计算，获取每一样本特征对应的重要性分析结果。S5023: Perform standard deviation calculation on the current correlation coefficients corresponding to all the original high-order combined features, and obtain an importance analysis result corresponding to each sample feature.

本示例中，可以采用标准差计算公式对所有原始高阶组合特征中同一样本特征对应的相关系数值进行标准差计算，以确定该样本特征对应的重要性标准差，将该重要性标准差确定为该样本特征对应的重要性分析结果，由于每一样本特征对应的特征树，因此，该样本特征对应的重要性分析结果也可以理解为特征树的重要性分析结果。一般来说，任一样本特征的重要性标准差越小，说明所有原始高阶组合特征中该样本特征对应的样本特征值的分布越均匀稳定，波动越小；反之，任一样本特征的重要性标准差越大，说明所有原始高阶组合特征中该样本特征对应的样本特征值的分布越不均匀稳定，波动越大。In this example, the standard deviation calculation formula can be used to calculate the standard deviation of the correlation coefficient values corresponding to the same sample feature in all the original high-order combined features, so as to determine the importance standard deviation corresponding to the sample feature, and determine the importance standard deviation. For the importance analysis result corresponding to the sample feature, due to the feature tree corresponding to each sample feature, the importance analysis result corresponding to the sample feature can also be understood as the importance analysis result of the feature tree. Generally speaking, the smaller the standard deviation of the importance of any sample feature, the more uniform and stable the distribution of the sample feature values corresponding to the sample feature in all the original high-order combined features, and the smaller the fluctuation; on the contrary, the importance of any sample feature The larger the standard deviation is, the more uneven and stable the distribution of the sample eigenvalues corresponding to the sample feature in all the original high-order combined features, and the greater the fluctuation.

S503：若饱和度分析结果符合饱和度标准阈值，且重要性分析结果符合重要性标准阈值，则将样本特征对应的初始森林模型中的初始叶子节点确定为有效叶子节点。S503: If the saturation analysis result meets the saturation standard threshold, and the importance analysis result meets the importance standard threshold, determine the initial leaf node in the initial forest model corresponding to the sample feature as an effective leaf node.

其中，饱和度标准阈值是预先设置的用于评估饱和度是否达标的阈值。重要性分析阈值是预先设置的用于评估重要性是否达标的阈值。本示例中，将饱和度分析结果与饱和度标准阈值进行比较，并将重要性分析结果与重要性标准阈值进行比较；若任一样本特征对应的重要性分析结果符合重要性标准阈值，即该样本特征的重要性标准差小于重要性标准阈值，则说明该样本特征对应的样本特征值较稳定，波动较小；若任一初始叶子节点的饱和度分析结果符合饱和度标准阈值，即该初始叶子节点的饱和度标准差小于饱和度标准阈值，说明该初始叶子节点在时间分布上较均匀稳定，波动较小；本示例中，将样本特征对应的初始森林模型中的初始叶子节点确定为有效叶子节点，以便后续利用该有效叶子节点对初始森林模型进行截枝，以形成有效森林模型。该有效叶子节点可以理解为重要性分析结果符合重要性标准阈值的样本特征对应的特征树中，饱和度分析结果符合饱和度标准阈值的一个初始叶子节点。The saturation standard threshold is a preset threshold for evaluating whether the saturation meets the standard. The importance analysis threshold is a preset threshold for evaluating whether the importance is up to the target. In this example, the saturation analysis result is compared with the saturation standard threshold, and the importance analysis result is compared with the importance standard threshold; if the importance analysis result corresponding to any sample feature meets the importance standard threshold, the If the importance standard deviation of the sample feature is less than the importance standard threshold, it means that the sample feature value corresponding to the sample feature is relatively stable and has little fluctuation; if the saturation analysis result of any initial leaf node meets the saturation standard threshold, that is, the initial The saturation standard deviation of the leaf node is less than the saturation standard threshold, indicating that the initial leaf node is more uniform and stable in time distribution and has less fluctuation; in this example, the initial leaf node in the initial forest model corresponding to the sample feature is determined as valid. Leaf node, so that the initial forest model can be truncated by using the effective leaf node to form an effective forest model. The effective leaf node can be understood as an initial leaf node whose saturation analysis result meets the saturation standard threshold in the feature tree corresponding to the sample feature whose importance analysis result meets the importance standard threshold.

本实施例提供的样本模型训练方法中，基于时间标签和原始高阶组合特征进行饱和度分析，以使获取的饱和度分析结果可以从时间分布角度确定所有原始高阶组合特征是否均匀稳定；再基于样本标签和原始高阶组合特征进行重要性分析，以使获取的重要性分析结果可以反映原始高阶组合特征与模型训练目的相匹配，并确定数据分布是否均匀稳定；将饱和度分析结果符合饱和度标准阈值，且重要性分析结果符合重要性标准阈值的初始叶子节点确定为有效叶子节点，以实现综合考虑饱和度分析结果和重要性分析结果，从而去除波动较大的初始叶子节点，使得基于有效森林模型输出的模型训练样本在后续模型训练过程中出现过拟合，导致无法学习到稳定的DeepFM模型或者训练所得的DeepFM模型的输出结果准确性较低的问题。In the sample model training method provided in this embodiment, saturation analysis is performed based on the time label and the original high-order combined features, so that the obtained saturation analysis results can determine whether all the original high-order combined features are uniform and stable from the perspective of time distribution; Carry out importance analysis based on sample labels and original high-order combined features, so that the obtained importance analysis results can reflect that the original high-order combined features match the model training purpose, and determine whether the data distribution is uniform and stable; Saturation standard threshold, and the initial leaf node whose importance analysis result meets the importance standard threshold is determined as an effective leaf node, so as to comprehensively consider the saturation analysis result and the importance analysis result, so as to remove the initial leaf node with large fluctuation, so that The model training samples based on the output of the effective forest model are overfitted in the subsequent model training process, resulting in the inability to learn a stable DeepFM model or the low accuracy of the output results of the trained DeepFM model.

在一实施例中，如图6所示，步骤S205，即基于样本标签和有效高阶组合特征进行LR正则化筛选，确定目标叶子节点，具体包括如下步骤：In one embodiment, as shown in FIG. 6 , step S205, that is, performing LR regularization screening based on sample labels and effective high-order combined features to determine the target leaf node, which specifically includes the following steps:

S601：将所有有效高阶组合划分成训练集和验证集，基于训练集中的有效高阶组合特征进行LR建模，调整L2正则化系数，使得验证集中的有效高阶组合特征的AUC最大，以获取目标LR模型。S601: Divide all valid high-order combinations into a training set and a validation set, perform LR modeling based on the valid higher-order combination features in the training set, and adjust the L2 regularization coefficient to maximize the AUC of the valid higher-order combination features in the validation set, with Get the target LR model.

作为一示例，可以将所有有效高阶组合特征依据7：3的比例，划分为训练集和验证集，将训练集中所有有效高阶组合特征及其对应的样本标签作为输出，进行LR建模，通过调节L2正则化系数，使得LR建模确定的目标LR模型更平滑，且使验证集中的有效高阶组合特征在目标LR模型中形成的AUC最大，以完成目标LR模型的建模过程。As an example, all effective high-order combined features can be divided into a training set and a validation set according to a ratio of 7:3, and all effective high-order combined features and their corresponding sample labels in the training set can be used as outputs to perform LR modeling, By adjusting the L2 regularization coefficient, the target LR model determined by LR modeling is made smoother, and the AUC formed by the effective high-order combined features in the validation set in the target LR model is maximized, so as to complete the modeling process of the target LR model.

例如，可采用迭代式网格搜索方式调整L2正则化系数，使得验证集中的AUC最大，例如先尝试1，0.1，0.01，0.001，0.0001，发现0.001最后，则下一组实验设计在0.001附近，例如[0.0005，0.001，0.002]；重复这个步骤几次，直到没有明显提升，完成目标LR模型的建模过程，并确定给定L2正则化系数。其中，AUC(Area Under Curve)被定义为ROC曲线下与坐标轴围成的面积，显然这个面积的数值不会大于1。又由于ROC曲线一般都处于y＝x这条直线的上方，所以AUC的取值范围在0.5和1之间。AUC越接近1.0，检测方法真实性越高；等于0.5时，则真实性最低，无应用价值。L2正则化的模型叫做Ridge回归(岭回归)，可以防止模型过拟合(overfitting)。For example, an iterative grid search method can be used to adjust the L2 regularization coefficient to maximize the AUC in the validation set. For example, try 1, 0.1, 0.01, 0.001, 0.0001 first, and find 0.001. Finally, the next set of experiments is designed around 0.001. For example [0.0005, 0.001, 0.002]; repeat this step several times until there is no significant improvement, complete the modeling process of the target LR model, and determine the given L2 regularization coefficient. Among them, AUC (Area Under Curve) is defined as the area enclosed by the ROC curve and the coordinate axis. Obviously, the value of this area will not be greater than 1. Also, since the ROC curve is generally above the straight line y=x, the value range of AUC is between 0.5 and 1. The closer the AUC is to 1.0, the higher the authenticity of the detection method; when it is equal to 0.5, the authenticity is the lowest and has no application value. The model of L2 regularization is called Ridge regression (ridge regression), which can prevent the model from overfitting.

S602：基于目标LR模型，获取有效森林模型中的每一有效叶子节点对应的LR系数的绝对值。S602: Based on the target LR model, obtain the absolute value of the LR coefficient corresponding to each valid leaf node in the valid forest model.

本示例中，针对给定L2正则化系数确定的与有效森林模型相对应的目标LR模型，根据目标LR模型，确定有效森林模型中的每一有效叶子节点对应的LR系数，从而获取该LR系数对应的绝对值。本示例中的LR系数可以为LR建模后确定的目标LR模型中某一变量相关的系数，该变量与有效叶子节点相对应。In this example, for the target LR model corresponding to the effective forest model determined by the given L2 regularization coefficient, the LR coefficient corresponding to each effective leaf node in the effective forest model is determined according to the target LR model, so as to obtain the LR coefficient corresponding absolute value. The LR coefficient in this example may be a coefficient related to a certain variable in the target LR model determined after LR modeling, and the variable corresponds to an effective leaf node.

S603：选择LR系数的绝对值较大的预设数量的有效叶子节点，确定为目标叶子节点。S603: Select a preset number of valid leaf nodes with a larger absolute value of the LR coefficient, and determine them as target leaf nodes.

其中，预设数量为预先设置的需要保留的目标叶子节点的数量，设为X。若有效森林模型中的有效叶子节点的数量为1000个，在将每一有效叶子节点对应的LR系数的绝对值进行排序，选择前面X位的有效叶子节点，确定为目标叶子节点，以保证选择的X个目标叶子节点所形成的LR模型与所有目标叶子节点所形成的目标LR模型的AUC基本一致，从而确定与样本标签最相关的叶子节点。The preset number is the preset number of target leaf nodes that need to be reserved, and is set to X. If the number of valid leaf nodes in the valid forest model is 1000, the absolute value of the LR coefficient corresponding to each valid leaf node is sorted, and the valid leaf node with the first X bits is selected as the target leaf node to ensure the selection. The AUC of the LR model formed by the X target leaf nodes is basically the same as that of the target LR model formed by all target leaf nodes, so as to determine the leaf node most relevant to the sample label.

例如，在基于XGBOOST构建的初始森林模型包括300棵特征树，若300棵特征树一共有10000个初始叶子节点，在进行稳定性筛选后，可形成只包括8000个有效叶子节点的有效森林模型。若所有有效高阶组合特征的数量为100000个，依据7：3划分为训练集和验证集；然后，采用训练集中的70000个有效高阶组合特征进行LR建模，则70000个有效高阶组合特征形成[70000，8000]的矩形喂给目标LR模型，y的维度为[70000，1]的0/1向量；运行目标LR模型，查看8000个有效叶子节点对应的LR系数，这些LR系数有正有负，有大有小，因此，需取所有LR系数的绝对值；可以选取前X(例如X＝3000或者4000)个有效叶子节点。接着，采用验证集中的30000个有效高阶组合特征对8000个有效叶子节点和8000个有效叶子节点中LR系数的绝对值在前X个有效叶子节点进行验证，判断两者在目标LR模型中的AUC基本一致，即两者的AUC相似度达到相似度阈值，或者ACU差值小于预设差值，则将前X个有效叶子节点确定为目标叶子节点，使得目标叶子节点的数量较小，但基本可达到与所有有效叶子节点相同的效果，从而保证基于目标森林模型输出的模型训练样本的维度较小，且不影响模型训练目的实现的准确性。For example, the initial forest model constructed based on XGBOOST includes 300 feature trees. If the 300 feature trees have a total of 10,000 initial leaf nodes, after stability screening, an effective forest model with only 8,000 valid leaf nodes can be formed. If the number of all valid high-order combination features is 100,000, it is divided into training set and validation set according to 7:3; The feature forms a rectangle of [70000, 8000] and feeds the target LR model, and the y dimension is a 0/1 vector of [70000, 1]; run the target LR model and check the LR coefficients corresponding to 8000 valid leaf nodes. These LR coefficients are Positive or negative, large or small, therefore, the absolute value of all LR coefficients needs to be taken; the first X (for example, X=3000 or 4000) valid leaf nodes can be selected. Next, use the 30,000 valid high-order combined features in the validation set to verify the absolute value of the LR coefficient in the 8,000 valid leaf nodes and the 8,000 valid leaf nodes in the first X valid leaf nodes, and judge whether the two are in the target LR model. The AUC is basically the same, that is, the AUC similarity between the two reaches the similarity threshold, or the ACU difference is less than the preset difference, then the first X valid leaf nodes are determined as the target leaf nodes, so that the number of target leaf nodes is small, but It can basically achieve the same effect as all valid leaf nodes, thereby ensuring that the model training sample output based on the target forest model has a small dimension and does not affect the accuracy of the model training purpose.

本实施例提供的样本模型训练方法中，利用有效高阶组合特征进行LR建模，通过调节L2正则化系数，使得目标LR模型较平滑，结果更准确；再计算目标LR模型中每一有效叶子节点对应的LR系数的绝对值，选取LR系数的绝对值较大的预设数量的有效叶子节点确定为目标叶子节点，使得形成的目标森林模型中的目标叶子节点的数量进一步减少，且通过目标森林模型输出的模型训练样本与通过有效森林模型输出的模型训练样本在模型训练过程中获取的结果较相似，从而实现保证基于目标森林模型输出的模型训练样本的维度较小，且不影响模型训练目的实现的准确性。In the sample model training method provided in this embodiment, effective high-order combined features are used for LR modeling, and the L2 regularization coefficient is adjusted to make the target LR model smoother and the result more accurate; and then calculate each effective leaf in the target LR model. The absolute value of the LR coefficient corresponding to the node, select a preset number of valid leaf nodes with a larger absolute value of the LR coefficient to be determined as the target leaf node, so that the number of target leaf nodes in the formed target forest model is further reduced, and through the target The model training samples output by the forest model and the model training samples output by the effective forest model are similar to the results obtained in the model training process, so as to ensure that the model training samples output based on the target forest model have a smaller dimension and do not affect the model training. Accuracy of Purpose Achievement.

本发明实施例提供的样本生成方法方法，该样本生成方法方法可应用如图1所示的计算机设备，该计算机设备可以是服务器。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中，该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机程序和数据库。该内存储器为非易失性存储介质中的操作系统和计算机程序的运行提供环境。该计算机设备的数据库用于存储执行样本生成方法过程中采用或者生成的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机程序被处理器执行时以实现一种样本生成方法。The sample generation method provided by the embodiment of the present invention may be applied to the computer device shown in FIG. 1 , and the computer device may be a server. The computer device includes a processor, memory, a network interface, and a database connected by a system bus. Among them, the processor of the computer device is used to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium, an internal memory. The nonvolatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the execution of the operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store data used or generated during the execution of the sample generation method. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer program, when executed by a processor, implements a sample generation method.

在一实施例中，如图7所示，提供一种样本生成方法，以该样本生成方法应用在图1所示的计算机设备为例，该样本特征方法包括：In one embodiment, as shown in FIG. 7 , a sample generation method is provided. Taking the sample generation method applied to the computer device shown in FIG. 1 as an example, the sample feature method includes:

S701：获取待处理数据，待处理数据包括至少两个样本特征对应的特征数据。S701: Acquire data to be processed, where the data to be processed includes feature data corresponding to at least two sample features.

其中，待处理数据是未经处理的用于生成模型训练样本的数据。该待处理数据包括至少两个样本特征对应的特征数据，其与上述实施例中原始训练数据中的至少两个样本特征对应的特征数据相对应，但该待处理数据不携带样本标签，需先形成模型训练样本，再将模型训练样本输入DeepFM模型进行模型训练，以学习出其对应的样本标签。The data to be processed is unprocessed data used to generate model training samples. The data to be processed includes feature data corresponding to at least two sample features, which corresponds to the feature data corresponding to at least two sample features in the original training data in the above-mentioned embodiment, but the data to be processed does not carry sample labels, and needs to be first A model training sample is formed, and then the model training sample is input into the DeepFM model for model training to learn its corresponding sample label.

S702：将至少两个样本特征对应的特征数据输入上述样本模型训练方法确定的目标森林模型，将目标森林模型输出的One-Hot编码形式的目标高阶组合特征，确定为DeepFM模型的模型训练样本。S702: Input the feature data corresponding to at least two sample features into the target forest model determined by the above-mentioned sample model training method, and determine the target high-order combined features in the form of One-Hot encoding output by the target forest model as the model training samples of the DeepFM model .

本实施例所提供的样本生成方法中，采用上述实施例确定的目标森林模型对待处理数据至少两个样本特征的特征数据转换成包括至少两个数据域的One-Hot编码形式的高阶组合特征，使其形成可以输入DeepFM模型进行模型训练的模型训练样本；而且，由于目标森林模型是对初始森林模型中的初始叶子节点经过稳定性筛选和LR正则化筛选确定的森林模型，经过二次筛选降维，使得形成的高阶组合特征的维度较低，使其输出的模型训练样本输入到DeepFM模型进行模型训练时，可以节省训练过程占用的系统资源，并缩短训练时长；而且，目标森林模型中的目标叶子节点与DeepFM模型的模型训练目的相匹配，并过滤稳定性较低的叶子节点，减少目标森林模型输出的模型训练样本在DeepFM模型训练过程中出现过拟合现象，节省模型训练过程所占用的系统资源，有助于提高模型训练样本训练DeepFM模型的准确性。In the sample generation method provided in this embodiment, the target forest model determined in the above embodiment is used to convert the feature data of at least two sample features of the data to be processed into high-order combined features in the form of One-Hot encoding including at least two data domains , so that it forms a model training sample that can be input into the DeepFM model for model training; moreover, since the target forest model is a forest model determined by stability screening and LR regularization screening of the initial leaf nodes in the initial forest model, after secondary screening Dimensionality reduction makes the dimension of the formed high-order combined features lower, so that when the output model training samples are input to the DeepFM model for model training, the system resources occupied by the training process can be saved and the training time can be shortened; moreover, the target forest model The target leaf nodes in the model match the model training purpose of the DeepFM model, and filter the leaf nodes with lower stability to reduce the model training samples output by the target forest model. Overfitting occurs during the training of the DeepFM model, saving the model training process. The occupied system resources help to improve the accuracy of the DeepFM model trained by the model training samples.

应理解，上述实施例中各步骤的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本发明实施例的实施过程构成任何限定。It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the sequence of execution, and the execution sequence of each process should be determined by its functions and internal logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

在一实施例中，提供一种样本模型训练装置，该样本模型训练装置与上述实施例中样本模型训练方法一一对应。如图8所示，该样本模型训练装置包括如下功能模块，各功能模块详细说明如下：In one embodiment, a sample model training apparatus is provided, and the sample model training apparatus corresponds one-to-one with the sample model training method in the above-mentioned embodiment. As shown in Figure 8, the sample model training device includes the following functional modules, and the detailed description of each functional module is as follows:

原始训练数据获取模块801，用于获取原始训练数据，原始训练数据包括样本标签和至少两个样本特征对应的特征数据。The original training data obtaining module 801 is configured to obtain original training data, where the original training data includes sample labels and feature data corresponding to at least two sample features.

原始高阶组合特征获取模块802，用于将原始训练数据输入到基于树模型构建的初始森林模型，获取原始训练数据对应的One-Hot编码形式的原始高阶组合特征，初始森林模型包括依序排布的至少两棵特征树，每一特征树与一样本特征相对应，包括至少两个初始叶子节点。The original high-order combined feature acquisition module 802 is used to input the original training data into the initial forest model constructed based on the tree model, and obtain the original high-order combined features in the form of One-Hot encoding corresponding to the original training data. The initial forest model includes sequential At least two feature trees are arranged, each feature tree corresponds to a sample feature, and includes at least two initial leaf nodes.

有效森林模型获取模块803，用于基于样本标签和原始高阶组合特征进行稳定性筛选，确定有效叶子节点，基于有效叶子节点对初始森林模型的初始叶子节点进行截枝，获取有效森林模型。The valid forest model obtaining module 803 is used for performing stability screening based on the sample label and the original high-order combination feature, determining valid leaf nodes, and truncating the initial leaf nodes of the initial forest model based on the valid leaf nodes to obtain the valid forest model.

有效高阶组合特征获取模块804，用于将原始训练数据输入到有效森林模型，获取原始训练数据对应的One-Hot编码形式的有效高阶组合特征。The effective high-order combined feature acquisition module 804 is configured to input the original training data into the effective forest model, and obtain the effective high-order combined feature in the form of One-Hot encoding corresponding to the original training data.

目标森林模型获取模块805，用于基于样本标签和有效高阶组合特征进行LR正则化筛选，确定目标叶子节点，基于目标叶子节点对有效森林模型中的有效叶子节点进行截枝，获取目标森林模型。The target forest model acquisition module 805 is used to perform LR regularization screening based on the sample label and the effective high-order combination feature, determine the target leaf node, and truncate the effective leaf node in the effective forest model based on the target leaf node to obtain the target forest model. .

优选地，原始高阶组合特征获取模块802，包括：Preferably, the original high-order combined feature acquisition module 802 includes:

初始特征获取单元，用于将至少两个样本特征对应的特征数据，分别输入样本特征对应的特征树进行处理，获取样本特征对应的初始特征。The initial feature acquisition unit is used to input the feature data corresponding to at least two sample features into the feature tree corresponding to the sample features for processing, and obtain the initial features corresponding to the sample features.

初始特征拼接单元，用于依据至少两棵特征树的排布顺序，对至少两个样本特征对应的初始特征进行拼接，获取原始训练数据对应的One-Hot编码形式的原始高阶组合特征。The initial feature splicing unit is used for splicing the initial features corresponding to the at least two sample features according to the arrangement order of the at least two feature trees to obtain the original high-order combined features in the form of One-Hot encoding corresponding to the original training data.

优选地，原始训练数据还包括时间标签。有效森林模型获取模块803，包括：Preferably, the original training data also includes time labels. The effective forest model acquisition module 803 includes:

饱和度分析结果获取单元，用于基于时间标签和原始高阶组合特征进行饱和度分析，获取每一样本特征对应的饱和度分析结果。The saturation analysis result obtaining unit is used to perform saturation analysis based on the time tag and the original high-order combined feature, and obtain the saturation analysis result corresponding to each sample feature.

重要性分析结果获取单元，用于基于样本标签和原始高阶组合特征进行重要性分析，获取每一样本特征对应的重要性分析结果。The importance analysis result obtaining unit is used to perform the importance analysis based on the sample label and the original high-order combined feature, and obtain the importance analysis result corresponding to each sample feature.

有效叶子节点获取单元，用于若饱和度分析结果符合饱和度标准阈值，且重要性分析结果符合重要性标准阈值，则将样本特征对应的初始森林模型中的初始叶子节点确定为有效叶子节点。The valid leaf node acquisition unit is used to determine the initial leaf node in the initial forest model corresponding to the sample feature as the valid leaf node if the saturation analysis result meets the saturation standard threshold and the importance analysis result meets the importance standard threshold.

优选地，饱和度分析结果获取单元，包括：Preferably, the saturation analysis result acquisition unit includes:

时间特征组划分子单元，用于基于时间分组周期，对与时间标签相对应的原始高阶组合特征进行分组，获取至少两个时间特征组。The time feature group dividing subunit is used for grouping the original high-order combined features corresponding to the time labels based on the time grouping period to obtain at least two time feature groups.

当前饱和度获取子单元，用于统计时间特征组中原始高阶组合特征的第一特征数量，统计时间特征组中同一样本特征对应的初始叶子节点中原始高阶组合特征的第二特征数量，基于第一特征数量和第二特征数量，确定每一初始叶子节点的当前饱和度。The current saturation acquisition subunit is used to count the number of first features of the original high-order combined features in the temporal feature group, and to count the number of second features of the original high-order combined features in the initial leaf node corresponding to the same sample feature in the temporal feature group, Based on the first feature quantity and the second feature quantity, the current saturation of each initial leaf node is determined.

饱和度分析结果获取子单元，用于对至少两个时间特征组中，同一初始叶子节点的当前饱和度进行标准差计算，获取每一样本特征对应的饱和度分析结果。The saturation analysis result obtaining subunit is used to calculate the standard deviation of the current saturation of the same initial leaf node in at least two time feature groups, and obtain the saturation analysis result corresponding to each sample feature.

优选地，重要性分析结果获取单元，包括：Preferably, the importance analysis result acquisition unit includes:

标准特征值获取子单元，用于从样本标签与模型训练目的相匹配的原始高阶组合特征中，统计同一样本特征对应的初始叶子节点中原始高阶组合特征的第三特征数量，将第三特征数量最大的样本特征值确定为样本特征对应的标准特征值。The standard feature value acquisition subunit is used to count the number of third features of the original high-order combined features in the initial leaf nodes corresponding to the same sample feature from the original high-order combined features whose sample labels match the model training purpose, and the third The sample feature value with the largest number of features is determined as the standard feature value corresponding to the sample feature.

当前相关系数获取子单元，用于原始高阶组合特征中每一样本特征对应的样本特征值与标准特征值，确定每一样本特征的当前相关系数。The current correlation coefficient obtaining subunit is used for the sample feature value and the standard feature value corresponding to each sample feature in the original high-order combined feature to determine the current correlation coefficient of each sample feature.

重要性分析结果获取子单元，用于对所有原始高阶组合特征对应的当前相关系数进行标准差计算，获取每一样本特征对应的重要性分析结果。The importance analysis result acquisition subunit is used to calculate the standard deviation of the current correlation coefficients corresponding to all original high-order combined features, and obtain the importance analysis results corresponding to each sample feature.

优选地，目标森林模型获取模块805，包括：Preferably, the target forest model acquisition module 805 includes:

目标LR模型获取单元，用于将所有有效高阶组合划分成训练集和验证集，基于训练集中的有效高阶组合特征进行LR建模，调整L2正则化系数，使得验证集中的有效高阶组合特征的AUC最大，以获取目标LR模型。The target LR model acquisition unit is used to divide all valid higher-order combinations into training sets and validation sets, perform LR modeling based on the effective higher-order combination features in the training set, and adjust the L2 regularization coefficient to make the valid higher-order combinations in the validation set. The AUC of the feature is the largest to obtain the target LR model.

系数绝对值获取单元，用于基于目标LR模型，获取有效森林模型中的每一有效叶子节点对应的LR系数的绝对值。The coefficient absolute value obtaining unit is used for obtaining the absolute value of the LR coefficient corresponding to each valid leaf node in the valid forest model based on the target LR model.

目标叶子节点获取单元，用于选择LR系数的绝对值较大的预设数量的有效叶子节点，确定为目标叶子节点。The target leaf node obtaining unit is configured to select a preset number of valid leaf nodes with a larger absolute value of the LR coefficient, and determine them as target leaf nodes.

关于样本模型训练装置的具体限定可以参见上文中对于样本模型训练方法的限定，在此不再赘述。上述样本模型训练装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中，也可以以软件形式存储于计算机设备中的存储器中，以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the sample model training apparatus, reference may be made to the limitation on the sample model training method above, which will not be repeated here. Each module in the above-mentioned sample model training apparatus may be implemented in whole or in part by software, hardware and combinations thereof. The above modules can be embedded in or independent of the processor in the computer device in the form of hardware, or stored in the memory in the computer device in the form of software, so that the processor can call and execute the operations corresponding to the above modules.

在一实施例中，提供一种样本生成装置，该样本生成装置与上述实施例中样本生成方法一一对应。如图9所示，该样本生成装置包括如下功能模块，各功能模块详细说明如下：In one embodiment, a sample generation apparatus is provided, and the sample generation apparatus corresponds to the sample generation method in the above-mentioned embodiment one-to-one. As shown in FIG. 9 , the sample generation device includes the following functional modules, and each functional module is described in detail as follows:

待处理数据获取模块901，用于获取待处理数据，待处理数据包括至少两个样本特征对应的特征数据。The to-be-processed data acquisition module 901 is configured to acquire the to-be-processed data, where the to-be-processed data includes feature data corresponding to at least two sample features.

模型训练样本获取模块902，用于将至少两个样本特征对应的特征数据输入上述样本模型训练方法确定的目标森林模型，将目标森林模型输出的One-Hot编码形式的目标高阶组合特征，确定为DeepFM模型的模型训练样本。The model training sample acquisition module 902 is used to input the feature data corresponding to at least two sample features into the target forest model determined by the above-mentioned sample model training method, and input the target high-order combined features in the form of One-Hot encoding output by the target forest model to determine Model training samples for the DeepFM model.

在一个实施例中，提供了一种计算机设备，包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序，处理器执行计算机程序时实现上述实施例中样本模型训练方法，例如图2所示S201-S205，或者图3、图5和图6所示，为避免重复，这里不再赘述。或者，处理器执行计算机程序时实现样本模型训练装置这一实施例中的各模块/单元的功能，例如图8所示的各模块/单元的功能，为避免重复，这里不再赘述。或者，处理器执行计算机程序时实现上述实施例中样本生成方法，例如图7所示S701-S702，为避免重复，这里不再赘述。或者，处理器执行计算机程序时实现样本生成装置这一实施例中的各模块/单元的功能，例如图9所示的各模块/单元的功能，为避免重复，这里不再赘述。In one embodiment, a computer device is provided, including a memory, a processor, and a computer program stored in the memory and running on the processor. When the processor executes the computer program, the sample model training method in the above embodiment is implemented, For example, S201-S205 shown in FIG. 2 , or shown in FIG. 3 , FIG. 5 and FIG. 6 , in order to avoid repetition, details are not repeated here. Alternatively, when the processor executes the computer program, the function of each module/unit in this embodiment of the sample model training apparatus, such as the function of each module/unit shown in FIG. 8 , is not repeated here to avoid repetition. Alternatively, when the processor executes the computer program, the sample generation method in the above embodiment is implemented, for example, S701-S702 shown in FIG. 7 . To avoid repetition, details are not described here. Alternatively, when the processor executes the computer program, the functions of the modules/units in this embodiment of the sample generating apparatus, such as the functions of the modules/units shown in FIG. 9 , are not repeated here to avoid repetition.

在一实施例中，提供一计算机可读存储介质，该计算机可读存储介质上存储有计算机程序，该计算机程序被处理器执行时实现上述实施例中样本模型训练方法，例如图2所示S201-S205，或者图3、图5和图6所示，为避免重复，这里不再赘述。或者，该计算机程序被处理器执行时实现上述样本模型训练装置这一实施例中的各模块/单元的功能，例如图8的功能，为避免重复，这里不再赘述。或者，该计算机程序被处理器执行时实现上述实施例中样本生成方法，例如图7所示S701-S702，为避免重复，这里不再赘述。或者，该计算机程序被处理器执行时实现上述样本生成装置这一实施例中的各模块/单元的功能，例如图9所示的各模块/单元的功能，为避免重复，这里不再赘述。In one embodiment, a computer-readable storage medium is provided, and a computer program is stored on the computer-readable storage medium, and when the computer program is executed by the processor, the sample model training method in the above-mentioned embodiment is realized, for example, S201 shown in FIG. 2 -S205, or as shown in FIG. 3 , FIG. 5 and FIG. 6 , in order to avoid repetition, details are not repeated here. Alternatively, when the computer program is executed by the processor, the functions of the modules/units in this embodiment of the sample model training apparatus, such as the functions in FIG. 8 , are not repeated here to avoid repetition. Alternatively, when the computer program is executed by the processor, the sample generation method in the above-mentioned embodiment is implemented, for example, S701-S702 shown in FIG. 7 . To avoid repetition, details are not repeated here. Alternatively, when the computer program is executed by the processor, the functions of the modules/units in the embodiment of the sample generating apparatus, such as the functions of the modules/units shown in FIG. 9 , are not repeated here to avoid repetition.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的计算机程序可存储于一非易失性计算机可读取存储介质中，该计算机程序在执行时，可包括如上述各方法的实施例的流程。其中，本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用，均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限，RAM以多种形式可得，诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。Those of ordinary skill in the art can understand that all or part of the processes in the methods of the above embodiments can be implemented by instructing relevant hardware through a computer program, and the computer program can be stored in a non-volatile computer-readable storage In the medium, when the computer program is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other medium used in the various embodiments provided in this application may include non-volatile and/or volatile memory. Nonvolatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Road (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

所属领域的技术人员可以清楚地了解到，为了描述的方便和简洁，仅以上述各功能单元、模块的划分进行举例说明，实际应用中，可以根据需要而将上述功能分配由不同的功能单元、模块完成，即将所述装置的内部结构划分成不同的功能单元或模块，以完成以上描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and simplicity of description, only the division of the above-mentioned functional units and modules is used as an example. Module completion, that is, dividing the internal structure of the device into different functional units or modules to complete all or part of the functions described above.

以上所述实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述各实施例所记载的技术方案进行修改，或者对其中部分技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明各实施例技术方案的精神和范围，均应包含在本发明的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it is still possible to implement the foregoing implementations. The technical solutions described in the examples are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in the within the protection scope of the present invention.

Claims

1. A sample model training method is characterized by comprising the following steps:

acquiring original training data, wherein the original training data comprises a sample label and characteristic data corresponding to at least two sample characteristics;

inputting the original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combination features of a One-Hot coding form corresponding to the original training data, wherein the initial forest model comprises at least two feature trees which are sequentially arranged, each feature tree corresponds to One sample feature and comprises at least two initial leaf nodes;

performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain an effective forest model;

inputting the original training data into the effective forest model, and acquiring effective high-order combination characteristics of One-Hot coding form corresponding to the original training data;

and performing LR regularized screening based on the sample label and the effective high-order combination characteristics, determining a target leaf node, and performing pruning on the effective leaf node in the effective forest model based on the target leaf node to obtain the target forest model.

2. The sample model training method of claim 1, wherein the raw training data further comprises a time tag;

the stability screening based on the sample label and the original high-order combination characteristics to determine effective leaf nodes comprises:

performing saturation analysis based on the time labels and the original high-order combined features to obtain a saturation analysis result corresponding to each sample feature;

performing importance analysis based on the sample label and the original high-order combined features to obtain an importance analysis result corresponding to each sample feature;

and if the saturation analysis result accords with a saturation standard threshold and the importance analysis result accords with an importance standard threshold, determining an initial leaf node in the initial forest model corresponding to the sample characteristics as an effective leaf node.

3. The method for training the sample model according to claim 2, wherein the performing saturation analysis based on the time tag and the original high-order combined features to obtain a saturation analysis result corresponding to each sample feature comprises:

grouping original high-order combination characteristics corresponding to the time labels based on a time grouping period to obtain at least two time characteristic groups;

counting a first feature quantity of original high-order combination features in the time feature group, counting a second feature quantity of the original high-order combination features in initial leaf nodes corresponding to the same sample feature in the time feature group, and determining the current saturation of each initial leaf node based on the first feature quantity and the second feature quantity;

and calculating the standard deviation of the current saturation of the same initial leaf node in at least two time feature groups to obtain a saturation analysis result corresponding to each sample feature.

4. The method for training the sample model according to claim 2, wherein the performing importance analysis based on the sample label and the original high-order combined feature to obtain an importance analysis result corresponding to each sample feature comprises:

counting a third feature quantity of original high-order combination features in an initial leaf node corresponding to the same sample feature from the original high-order combination features matched with the sample label and the model training purpose, and determining a sample feature value with the maximum third feature quantity as a standard feature value corresponding to the sample feature;

determining a current correlation coefficient of each sample characteristic according to a sample characteristic value corresponding to each sample characteristic in original high-order combined characteristics and the standard characteristic value;

and calculating the standard deviation of the current correlation coefficients corresponding to all the original high-order combination characteristics to obtain the importance analysis result corresponding to each sample characteristic.

5. The sample model training method of claim 1, wherein the performing LR regularization screening based on the sample labels and the valid high-order combination features to determine target leaf nodes comprises:

dividing all effective high-order combinations into a training set and a verification set, performing LR modeling based on effective high-order combination characteristics in the training set, and adjusting an L2 regularization coefficient to enable the AUC of the effective high-order combination characteristics in the verification set to be maximum so as to obtain a target LR model;

acquiring an absolute value of an LR coefficient corresponding to each effective leaf node in the effective forest model based on the target LR model;

and selecting a preset number of effective leaf nodes with larger absolute values of the LR coefficients, and determining the effective leaf nodes as target leaf nodes.

6. A method of generating a sample, comprising:

acquiring data to be processed, wherein the data to be processed comprises characteristic data corresponding to at least two sample characteristics;

inputting characteristic data corresponding to at least two sample characteristics into a target forest model determined by the sample model training method of any One of claims 1 to 5, and determining target high-order combined characteristics in a One-Hot coding form output by the target forest model as a model training sample of the DeepFM model.

7. A sample model training apparatus, comprising:

the system comprises an original training data acquisition module, a data processing module and a data processing module, wherein the original training data acquisition module is used for acquiring original training data which comprises sample labels and characteristic data corresponding to at least two sample characteristics;

the original high-order combined feature acquisition module is used for inputting the original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combined features of an One-Hot coding form corresponding to the original training data, wherein the initial forest model comprises at least two feature trees which are sequentially arranged, each feature tree corresponds to One sample feature and comprises at least two initial leaf nodes;

the effective forest model obtaining module is used for performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain an effective forest model;

the effective high-order combined feature acquisition module is used for inputting the original training data into the effective forest model and acquiring effective high-order combined features of an One-Hot coding form corresponding to the original training data;

and the target forest model acquisition module is used for performing LR regularized screening based on the sample label and the effective high-order combination characteristics, determining target leaf nodes, and performing pruning on the effective leaf nodes in the effective forest model based on the target leaf nodes to acquire the target forest model.

8. A sample generation device, comprising:

the device comprises a to-be-processed data acquisition module, a to-be-processed data acquisition module and a to-be-processed data processing module, wherein the to-be-processed data acquisition module is used for acquiring to-be-processed data which comprises characteristic data corresponding to at least two sample characteristics;

a model training sample obtaining module, configured to input feature data corresponding to at least two sample features into a target forest model determined by the sample model training method according to any One of claims 1 to 5, and determine a target high-order combination feature in a One-Hot coding form output by the target forest model as a model training sample of the deep fm model.

9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the sample model training method of any one of claims 1 to 5; alternatively, the processor, when executing the computer program, implements the sample generation method of claim 6.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a method of training a sample model according to any one of claims 1 to 5; alternatively, the computer program, when executed by a processor, implements the sample generation method of claim 6.