CN111581877A - Sample model training method, sample generation method, device, equipment and medium - Google Patents

Sample model training method, sample generation method, device, equipment and medium Download PDF

Info

Publication number
CN111581877A
CN111581877A CN202010218666.XA CN202010218666A CN111581877A CN 111581877 A CN111581877 A CN 111581877A CN 202010218666 A CN202010218666 A CN 202010218666A CN 111581877 A CN111581877 A CN 111581877A
Authority
CN
China
Prior art keywords
sample
model
feature
effective
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010218666.XA
Other languages
Chinese (zh)
Inventor
张跃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Life Insurance Company of China Ltd
Original Assignee
Ping An Life Insurance Company of China Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Life Insurance Company of China Ltd filed Critical Ping An Life Insurance Company of China Ltd
Priority to CN202010218666.XA priority Critical patent/CN111581877A/en
Publication of CN111581877A publication Critical patent/CN111581877A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a sample model training method, a sample generation device, sample model training equipment and a sample model generation medium. The method comprises the following steps: acquiring original training data, wherein the original training data comprises a sample label and characteristic data corresponding to at least two sample characteristics; inputting original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combination characteristics; performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial forest model based on the effective leaf nodes to obtain an effective forest model; inputting original training data into an effective forest model to obtain effective high-order combination characteristics; and performing LR regularized screening based on the sample labels and the effective high-order combination characteristics, determining target leaf nodes, and performing pruning on the effective forest model based on the target leaf nodes to obtain the target forest model. The dimensionality of the training sample of the target forest model output model is high, and the timeliness and the accuracy of model training can be guaranteed.

Description

Sample model training method, sample generation method, device, equipment and medium
Technical Field
The invention relates to the technical field of data processing, in particular to a sample model training method, a sample generating device, sample model training equipment and a sample generating medium.
Background
Because the DeepFM algorithm is effectively combined with the advantages of a factorization machine and a neural network in feature learning, the low-order combined feature and the high-order combined feature can be simultaneously extracted, so that the deep-level combined feature can be widely used in different fields. For example, user portrait data formed by a user access system or other scenes can be used as a model training sample, the model training sample is input into a deep fm model for model training, model parameters of the deep fm model are updated, and a user portrait analysis model based on deep fm is constructed, so that the user portrait analysis model can simultaneously extract low-order combination features and high-order combination features, and the analysis result is more accurate.
In the course of the deep FM model training, each model training sample comprises at least two data fields corresponding to the sample characteristics, the numerical value in each data field adopts One-Hot coding, and the size of each data field is determined according to the characteristic data of the sample characteristics. As an example, for an age sample feature, binary conversion may be performed on an age value to obtain a corresponding One-Hot code, and at this time, the size of the data field of the age sample feature is the length of the One-Hot code corresponding to the maximum age. For another example, the One-Hot code may be determined by dividing the sample feature of age into predetermined age groups, and in this case, the size of the data field of the sample feature of age is the number of age groups. The method is characterized in that the method can be used for converting the characteristic data of the city, including several characteristic data of Beijing, Shanghai, Tianjin, Chongqing and Guangdong into 10000, 01000, 00100, 00010 and 00001 respectively, and at the moment, the size of the data field of the city, which is the same characteristic, is the number of the preset characteristic data.
In the current deep FM model training process, each model training sample comprises at least two data fields, the size of each data field is determined according to the characteristic data of the sample characteristics, and under the conditions that the characteristic data corresponding to the sample characteristics has large time span, high discrete degree or poor stability and the like, the size of the data field of the sample characteristics is larger, so that the dimension of the formed model training sample is higher, and when the model training sample is input into a deep FM model for training, more system resources are needed in the model training process and the training time is longer; moreover, due to the fact that the dimensionality of the model training sample is high, overfitting is prone to occurring, and therefore the accuracy of the output result of the stable deep fm model cannot be learned or the deep fm model obtained through training is low.
Disclosure of Invention
The embodiment of the invention provides a sample model training method, a sample generating device, sample generating equipment and a sample generating medium, and aims to solve the problems that the dimensionality of a model training sample obtained by current deep FM model training is high, so that system resources required by model training are more, the training time is long, and the identification accuracy of a trained model is low.
The embodiment of the invention provides a sample model training method, which comprises the following steps:
acquiring original training data, wherein the original training data comprises a sample label and characteristic data corresponding to at least two sample characteristics;
inputting the original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combination features of a One-Hot coding form corresponding to the original training data, wherein the initial forest model comprises at least two feature trees which are sequentially arranged, each feature tree corresponds to One sample feature and comprises at least two initial leaf nodes;
performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain an effective forest model;
inputting the original training data into the effective forest model, and acquiring effective high-order combination characteristics of One-Hot coding form corresponding to the original training data;
and performing LR regularized screening based on the sample label and the effective high-order combination characteristics, determining a target leaf node, and performing pruning on the effective leaf node in the effective forest model based on the target leaf node to obtain the target forest model.
The embodiment of the invention provides a sample model training device, which comprises:
the system comprises an original training data acquisition module, a data processing module and a data processing module, wherein the original training data acquisition module is used for acquiring original training data which comprises sample labels and characteristic data corresponding to at least two sample characteristics;
the original high-order combined feature acquisition module is used for inputting the original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combined features of an One-Hot coding form corresponding to the original training data, wherein the initial forest model comprises at least two feature trees which are sequentially arranged, each feature tree corresponds to One sample feature and comprises at least two initial leaf nodes;
the effective forest model obtaining module is used for performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain an effective forest model;
the effective high-order combined feature acquisition module is used for inputting the original training data into the effective forest model and acquiring effective high-order combined features of an One-Hot coding form corresponding to the original training data;
and the target forest model acquisition module is used for performing LR regularized screening based on the sample label and the effective high-order combination characteristics, determining target leaf nodes, and performing pruning on the effective leaf nodes in the effective forest model based on the target leaf nodes to acquire the target forest model.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the above sample model training method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the above-mentioned sample model training method.
The embodiment of the invention provides a sample generation method, which comprises the following steps:
acquiring data to be processed, wherein the data to be processed comprises characteristic data corresponding to at least two sample characteristics;
and inputting characteristic data corresponding to at least two sample characteristics into the target forest model determined by the sample model training method, and determining the target high-order combined characteristics in the One-Hot coding form output by the target forest model as the model training sample of the deep FM model.
An embodiment of the present invention provides a sample generation apparatus, including:
the device comprises a to-be-processed data acquisition module, a to-be-processed data acquisition module and a to-be-processed data processing module, wherein the to-be-processed data acquisition module is used for acquiring to-be-processed data which comprises characteristic data corresponding to at least two sample characteristics;
and the model training sample acquisition module is used for inputting the characteristic data corresponding to at least two sample characteristics into the target forest model determined by the sample model training method, and determining the target high-order combination characteristics in the One-Hot coding form output by the target forest model as the model training sample of the DeepFM model.
A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the sample generation method when executing the computer program.
A computer-readable storage medium, in which a computer program is stored which, when executed by a processor, implements the above-described sample generation method.
In the sample model training method, the sample model training device, the sample model training equipment and the sample model training medium, the target forest model obtained by training can convert the feature data of at least two sample features in the original training data into the high-order combined features in the One-Hot coding form comprising at least two data fields, so that the high-order combined features can be input into a deep FM model for model training; moreover, because the target forest model is a forest model determined by performing stability screening and LR regularization screening on initial leaf nodes in the initial forest model, dimension reduction is performed through secondary screening, so that the dimension of the formed high-order combination features is low, and when the high-order combination features are output to a DeepFM model for model training, system resources occupied in the training process can be saved, and the training time is shortened; moreover, target leaf nodes in the target forest model are matched with the model training purpose of the DeepFM model, leaf nodes with low stability are filtered, the phenomenon that overfitting occurs in the model training sample output by the target forest model in the DeepFM model training process is reduced, system resources occupied in the model training process are saved, and the accuracy of the model training sample for training the DeepFM model is improved.
In the sample generation method, the sample generation device, the sample generation equipment and the sample generation medium, the target forest model determined by the embodiment is adopted to convert the characteristic data of at least two sample characteristics of the data to be processed into the high-order combined characteristics in the One-Hot coding form comprising at least two data fields, so that the high-order combined characteristics form a model training sample which can be input into a deep FM model for model training; moreover, because the target forest model is a forest model determined by performing stability screening and LR regularization screening on initial leaf nodes in the initial forest model, dimension reduction is performed through secondary screening, so that the dimension of the formed high-order combination features is low, and when the model training sample output by the target forest model is input into the deep FM model for model training, system resources occupied in the training process can be saved, and training time is shortened; moreover, target leaf nodes in the target forest model are matched with the model training purpose of the DeepFM model, leaf nodes with low stability are filtered, the phenomenon that overfitting occurs in the model training sample output by the target forest model in the DeepFM model training process is reduced, system resources occupied in the model training process are saved, and the accuracy of the model training sample for training the DeepFM model is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.
FIG. 1 is a schematic diagram of a computer device in one embodiment of the invention;
FIG. 2 is a flow chart of a sample model training method according to an embodiment of the present invention;
FIG. 3 is another flow chart of a sample model training method in an embodiment of the invention;
FIG. 4 is a diagram of a feature tree in the initial forest model according to an embodiment of the present invention;
FIG. 5 is another flow chart of a sample model training method in an embodiment of the invention;
FIG. 6 is another flow chart of a sample model training method in an embodiment of the invention;
FIG. 7 is another flow chart of a sample generation method in an embodiment of the invention;
FIG. 8 is a schematic diagram of a sample model training apparatus according to an embodiment of the present invention;
FIG. 9 is a schematic view of a sample generation apparatus in accordance with an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The sample model training method provided by the embodiment of the invention can be applied to the computer equipment shown in fig. 1, and the computer equipment can be a server. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data adopted or generated in the process of executing the sample model training method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sample model training method.
In an embodiment, as shown in fig. 2, a sample model training method is provided, which is described by taking the example that the sample model training method is applied to the computer device in fig. 1, and the sample model training method specifically includes the following steps:
s201: and acquiring original training data, wherein the original training data comprises sample labels and feature data corresponding to at least two sample features.
The raw training data refers to unprocessed data used for training the sample model. The sample labels are pre-labeled labels used for reflecting the purpose of sample model training. For example, when training a DeepFM-based user portrait analysis model for evaluating user access to a system, if the purpose of model training is to analyze whether a user accesses the system, marking a sample label corresponding to original training data as access or no access according to actual conditions, and setting the sample label to be 1 or 0 respectively; if the purpose of model training is to analyze whether a user intends to buy a specific product, marking the sample label corresponding to the original training data as buying or not buying according to the actual situation, and respectively setting the sample label to be 1 or 0; if the model training aims at analyzing the performance of the user, marking the sample labels corresponding to the original training data as high performance and low performance according to the actual situation, and setting the sample labels as 1 or 0 respectively.
Sample features refer to feature dimensions in the original training data. The feature data refers to specific values or specific information corresponding to sample features in the original training data. As an example, when training the user portrait analysis model, the corresponding original training data may include age, gender, income, academic calendar and other sample features, and accordingly, the feature data corresponding to the sample features is specific numerical values or specific information of the age, gender, income, academic calendar and other sample features. For example, at least two sample characteristics and their corresponding characteristic data, such as age-30, gender-male, income-10000, scholarly-subject, etc., may be stored in the form of key-value.
S202: inputting original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combination characteristics of an One-Hot coding form corresponding to the original training data, wherein the initial forest model comprises at least two characteristic trees which are sequentially arranged, each characteristic tree corresponds to a sample characteristic and comprises at least two initial leaf nodes.
The initial forest model constructed based on the tree model is a forest model which is constructed by adopting XGBOST, LightGBM or other tree models and comprises at least two feature trees arranged in sequence. Each feature tree corresponds to a sample feature and is used for classifying and dividing feature data corresponding to the sample feature. In the initial forest model, leaf nodes in each feature tree are initial leaf nodes. And the original high-order combination characteristic is an output result of original training data input into the initial forest model.
Specifically, if the initial forest model constructed based on the tree model includes N feature trees, the ith feature tree YiCorresponding to an initial leaf node of
Figure BDA0002425298670000061
Then the number of the initial leaf nodes formed based on the initial forest model is
Figure BDA0002425298670000062
Correspondingly, feature data corresponding to at least two sample features of original training data are input into an initial forest model for classification, the value of an initial leaf node corresponding to the feature data corresponding to the sample features in the initial forest model is determined to be 1, the values corresponding to the other initial leaf nodes are determined to be 0, so that original high-order combined features in an One-Hot coding form corresponding to the original training data are formed, the original high-order combined features comprise N data fields, each data field corresponds to a sample feature, the size of each data field is matched with the size of an initial leaf node of a feature tree corresponding to the corresponding sample feature, namely the size of the ith data field is equal to that of the initial leaf node of the feature tree corresponding to the corresponding sample feature
Figure BDA0002425298670000063
The dimension of the original high-order combination characteristic of the formed One-Hot coding form is
Figure BDA0002425298670000064
S203: and performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain the effective forest model.
Because the distribution of the feature data corresponding to at least two sample features in the original training data may have the situations of large time span, high discrete degree, poor stability and the like, if the original high-order combination features output directly based on the initial forest model are used as model training samples of the deep fm model, the accuracy and timeliness of model training of the subsequent deep fm model may be affected, so that stability screening needs to be performed based on the sample labels and the original high-order combination features, specifically, stability screening is performed on the numerical values of the data domain corresponding to the same sample feature in all the original high-order combination features and the sample labels to determine a relatively stable numerical range in the data domain, and the initial leaf node corresponding to the initial forest model in the initial forest model is determined to be an effective leaf node based on the numerical range. And then, carrying out pruning on the initial leaf nodes in the initial forest model based on the effective leaf nodes, namely, reserving the effective leaf nodes in the initial forest model, and deleting or combining other initial leaf nodes except the effective leaf nodes to form the effective forest model. The effective forest model comprises at least two effective leaf nodes, so that the dimension of high-order combination features formed by inputting original training data into the effective forest model is low, the stability is good, and the overfitting phenomenon in the subsequent DeepFM model training process is avoided.
As an example, if a sample feature is taken as the visit time, the feature tree corresponding to the visit time in the initial forest model may be obtained by dividing the visit time into time intervals corresponding to 24 hours according to the feature judgment condition, and there are 24 initial leaf nodes corresponding to the visit time, and the original high-order combination feature has a longer dimension, which is not beneficial to improving the accuracy and timeliness of the subsequent model training; moreover, the access time to access the system is relatively fixed for a particular user, 24: 00-6: 00, the number of times of accessing the system in the time period of 00 is small, even 0, and the number of times of accessing the system in other time periods is large, so that the access time, which is a sample feature, has the situations of large time span, high discrete degree, poor stability, and the like, and the stability screening is performed on the content of the sample feature, which is a sample tag and the access time, in the original high-order combination feature, and 6: 00-24: 00, determining an initial leaf node in the initial forest model as a valid leaf node, and 24: 00-6: and 00, deleting or combining five initial leaf nodes corresponding to the time period into an effective leaf node, and forming an effective forest model based on the effective leaf node, so that the dimensionality of high-order combination characteristics input by the effective forest model is reduced, the timeliness of the subsequent DeepFM model training process is ensured, required system resources are saved, and the over-fitting phenomenon is avoided.
In this example, if the initial forest model constructed based on the tree model includes N feature trees, the ith feature tree YiCorresponding to an initial leaf node of
Figure BDA0002425298670000071
After stability screening is carried out based on the sample labels and the original high-order combination characteristics, the ith characteristic tree YiThe corresponding valid leaf node is
Figure BDA0002425298670000072
If the number of the effective leaf nodes in the effective forest model is equal to
Figure BDA0002425298670000073
Wherein,
Figure BDA0002425298670000074
the method is beneficial to realizing the first dimensionality reduction of the initial forest model, so that the dimensionality of a finally formed model training sample is lower, the accuracy and the timeliness of the subsequent DeepFM model training are guaranteed, and the system resources required by the model training are saved.
S204: and inputting the original training data into an effective forest model, and acquiring effective high-order combination characteristics of the One-Hot coding form corresponding to the original training data.
And the effective high-order combination characteristics are output results of the original training data input into the effective forest model. As an example, feature data corresponding to at least two sample features of original training data are input into an effective forest model for classification, the value of the feature data corresponding to the sample features at effective leaf nodes corresponding to the effective forest model is determined to be 1, the values corresponding to the other effective leaf nodes are determined to be 0, so as to form effective high-order combined features corresponding to the original training data in a One-Hot coding form, the effective high-order combined features comprise N data fields, and the size of the ith data field is
Figure BDA0002425298670000075
The dimension of the original high-order combination characteristic of the formed One-Hot coding form is
Figure BDA0002425298670000076
Wherein,
Figure BDA0002425298670000077
the original high-order combination characteristics are subjected to dimensionality reduction so as to be convenient for subsequent analysis processing, the dimensionality of a finally formed model training sample is low, and the accuracy and timeliness of subsequent deep FM model training are favorably guaranteed.
S205: and performing LR regularized screening based on the sample labels and the effective high-order combination characteristics, determining target leaf nodes, and performing lopping on the effective leaf nodes in the effective forest model based on the target leaf nodes to obtain the target forest model.
Because the effective leaf nodes in the effective forest model may be irrelevant to the model training purpose of the subsequent DeepFM model, and the model training purpose of the DeepFM model is associated with the sample label, LR regularization screening can be performed based on the sample label and the effective high-order combination characteristics, so that the leaf nodes highly relevant to the model training purpose are screened out from all the effective leaf nodes and determined as target leaf nodes. And then, carrying out pruning on the effective forest model based on the target leaf node, namely, reserving the target leaf node in the effective forest model, and deleting or combining the effective leaf nodes except the target leaf node to form the target forest model. The target forest model is a forest model corresponding to a model training sample used for generating the DeepFM model after model training is completed. The target forest model is a forest model formed by further reducing the dimension of the effective forest model, a model training sample output by the target forest model can be a data content which is formed based on all target leaf nodes and comprises at least two data fields corresponding to each other, the model training sample output by the target forest model is guaranteed, the dimension of the model training sample is effectively reduced in the accuracy and timeliness of the DeepFM model training, the occupied system resources in the model training process are less, and the over-fitting phenomenon is avoided.
The target forest model trained by the sample model training method provided by the embodiment can convert the feature data of at least two sample features in the original training data into the high-order combined features of One-Hot coding form including at least two data fields, so that the high-order combined features can be input into a deep FM model for model training; moreover, because the target forest model is a forest model determined by performing stability screening and LR regularization screening on initial leaf nodes in the initial forest model, dimension reduction is performed through secondary screening, so that the dimension of the formed high-order combination features is low, and when the high-order combination features are output to a DeepFM model for model training, system resources occupied in the training process can be saved, and the training time is shortened; moreover, target leaf nodes in the target forest model are matched with the model training purpose of the DeepFM model, leaf nodes with low stability are filtered, the phenomenon that overfitting occurs in the model training sample output by the target forest model in the DeepFM model training process is reduced, system resources occupied in the model training process are saved, and the accuracy of the model training sample for training the DeepFM model is improved.
In an embodiment, as shown in fig. 3, inputting original training data into an initial forest model constructed based on a tree model, and obtaining an original high-order combination feature in an One-Hot coding form corresponding to the original training data specifically includes the following steps:
s301: and respectively inputting the feature data corresponding to at least two sample features into the feature trees corresponding to the sample features for processing to obtain initial features corresponding to the sample features.
Specifically, the initial forest model based on the tree model includes at least two feature trees arranged in sequence, each feature tree corresponds to a sample feature and can be used for analyzing feature data corresponding to the sample feature so as to convert the feature data corresponding to the sample feature into a One-Hot coding form, so that a model training sample which can be used for inputting the deep fm model for model training is formed conveniently. As an example, the feature tree may employ an XGBOST, LightGBM, or other tree model.
In this example, the feature tree corresponding to each sample feature is a tree formed based on at least One feature judgment condition, the feature data corresponding to the sample feature is input into the feature tree corresponding to the sample feature, the initial leaf node corresponding to the feature data is determined by using the at least One feature judgment condition, the value of the initial leaf node corresponding to the feature data is set to 1, and the values of other initial leaf nodes are set to 0, so as to obtain the initial feature in the One-Hot encoding form.
For example, regarding a sample feature of gender, the feature determination condition corresponding thereto may be "whether gender is male" such that 10 indicates gender as male and 01 indicates gender as female.
For another example, as shown in fig. 4, a sample feature of the revenue is denoted as S, and the corresponding feature determination condition includes "a 1: s >5000 "," A2: s >10000 "," a 3: s >15000 "," A4: s >20000 "," A: s >25000 "and" A6: and S > 30000' and other 6 feature judgment conditions, forming 7 initial leaf nodes of L1, L2, L3, L4, L5, L6 and L7 based on the 6 feature judgment conditions, wherein the income range corresponding to each initial leaf node is as follows: l1: s is less than or equal to 5000, L2: 5000< S ≦ 10000, L3: 10000< S ≦ 15000, L4: 15000< S ≦ 20000, L5: 20000< S.ltoreq.25000, L6: 250000< S ≦ 30000 and L7: s > 30000; at this time, if the feature data corresponding to the income in the original training data is 12000, and it falls into the initial leaf node of L3, the value of the initial leaf node of L3 is set to 1, and the values of the other initial leaf nodes are set to 0, so that the initial feature of 0000100 is formed.
S302: and splicing the initial features corresponding to the at least two sample features according to the arrangement sequence of the at least two feature trees to obtain the original high-order combination features of the One-Hot coding form corresponding to the original training data.
In this example, the initial forest model based on the tree model includes at least two feature trees arranged in sequence, each feature tree corresponds to a sample feature, and the arrangement sequence between the at least two feature trees reflects the sequence of the data fields formed by the at least two sample features. Specifically, if the initial forest model constructed based on the tree model includes N feature trees, the ith feature tree YiCorresponding to an initial leaf node of
Figure BDA0002425298670000091
Determining the arrangement sequence between at least two feature trees according to the i, and determining the expression form of the high-order combined features formed based on the initial forest model to be | S1|S2|S3|…|SiWhere | represents a data field, SiRepresenting the ith feature tree YiAnd outputting the initial characteristic, specifically the numerical value of the data field of the corresponding sample characteristic.
In one example, if the gender is the 1 st sample feature and the income is the 2 nd sample feature, and the gender in the original training data is male and the income is 12000, the 1 st feature tree Y is obtained1Formed initial feature S110, second feature tree Y2Formed initial feature S2To 0000100, initial features corresponding to at least two sample features are stitched to form |10|0000100| S3|…|SiAnd determining the original high-order combination characteristics of the One-Hot coding form corresponding to the original training data.
In the sample model training method provided in this embodiment, feature data corresponding to at least two sample features may be respectively input to a feature tree corresponding to the sample features for recognition processing, so as to convert the feature data into an initial feature in an One-Hot coding form; and splicing all initial features according to the arrangement sequence of at least two feature trees, thereby quickly forming original high-order combined features comprising at least two data fields, wherein each data field adopts a numerical value in an One-Hot coding form, so as to facilitate the subsequent deep FM model training.
In an embodiment, the raw training data further comprises a time tag, which is information that forms a time correlation with the raw training data. As shown in fig. 5, the method for determining effective leaf nodes by performing stability screening based on the sample labels and the original high-order combination features specifically includes the following steps:
s501: and performing saturation analysis based on the time labels and the original high-order combined characteristics to obtain a saturation analysis result corresponding to each sample characteristic.
Wherein, the saturation analysis is a process of analyzing whether data is saturated on a time distribution from a time dimension to determine whether the data is stable. The saturation analysis result corresponding to each sample feature is used for reflecting whether the original high-order combination feature corresponding to a certain sample feature is saturated or not in time distribution.
As an example, all the original high-order combined features may be grouped based on the time tag corresponding to each original high-order combined feature, since each original high-order combined feature includes sample feature values corresponding to at least two sample features, each sample feature value corresponds to one initial leaf node, a ratio of the number of each sample feature value corresponding to each initial leaf node to the sum of the number of all the sample feature values may be grouped, and then saturation analysis results of different initial leaf nodes may be determined based on the ratio values calculated by different groups. For example, if a sample feature corresponds to 4 leaf nodes, the proportion values corresponding to the 1 st to 4 th initial leaf nodes are determined to be 0%, 40% and 20% in the 1 st grouping, the proportion values corresponding to the 1 st to 4 th initial leaf nodes are determined to be 5%, 40%, 35% and 20% in the 3 rd grouping, the proportion values corresponding to the 1 st to 4 th initial leaf nodes are determined to be 20%, 35%, 30% and 15% in the 1 st grouping, and the fluctuation size of the fluctuation can be determined based on the maximum difference value between the multiple proportion values, so that the fluctuation of the proportion value of the 1 st initial leaf node is large, and the fluctuation of the proportion value of the 2 nd to 4 th initial leaf nodes is small, thereby obtaining the saturation analysis result corresponding to the sample feature.
In a specific embodiment, the step S501, namely performing saturation analysis based on the time tag and the original high-order combined feature to obtain a saturation analysis result corresponding to each sample feature, specifically includes the following steps:
s5011: and grouping the original high-order combination characteristics corresponding to the time labels based on the time grouping period to obtain at least two time characteristic groups.
The time grouping period is a period preset for time division, and may be a day, a week, a month, a quarter, or a year. The temporal feature set is a set for storing all the original higher-order combined features of the time-stamp within the corresponding time-grouping period. In this example, the original high-order combined features may be divided into T time feature groups based on the time grouping period, where T ≧ 2.
As an example, if the time grouping period is month, all the original high-order combined features may be divided into 12 time feature groups, and the corresponding time feature group is determined based on the time tag of each original high-order combined feature. For example, if the user accesses the system on day 1 month and 10, the time stamp recorded in the user portrait data (i.e., the original training data) formed by the user portrait data is day 1 month and 10, and the formed original high-order combination features can be divided into time feature groups corresponding to day 1 month.
S5012: counting a first feature quantity of original high-order combination features in the time feature group, counting a second feature quantity of the original high-order combination features in initial leaf nodes corresponding to the same sample feature in the time feature group, and determining the current saturation of each initial leaf node based on the first feature quantity and the second feature quantity.
Wherein the first feature quantity is the quantity of all original high-order combination features in the tth time feature group, and is set as KtSo that it is firstIn the t time feature groups, the number of original high-order combined features corresponding to all sample features (namely all data fields) is Kt. In the original high-order combination characteristics of the t-th time characteristic group, each original high-order combination characteristic comprises at least two data fields corresponding to sample characteristics, each data field corresponds to an initial leaf node in an initial forest model, the number of the original high-order combination characteristics corresponding to the jth initial leaf node in the original high-order combination characteristics of the t-th time characteristic group needs to be counted, and the number is determined as a second characteristic number Kt,j,Kt,jThe number of original high-order combination features corresponding to the jth initial leaf node in the tth time feature group,
Figure BDA0002425298670000111
g is the number of initial leaf nodes corresponding to any data field. Based on the first characteristic quantity KtAnd a second characteristic number Kt,jDetermining the current saturation of the jth initial leaf node in the tth time characteristic group
Figure BDA0002425298670000112
Wherein, Pt,jRepresenting the current saturation of the jth initial leaf node in the tth temporal signature set.
For example, if the time grouping period is monthly, the number T of formed time feature groups is 12, and the first feature number K of the original high-order combination features corresponding to the time feature group corresponding to 1 month is 1 month1If the size of the 1 st data field is 4, that is, g is 4, and it corresponds to 4 initial leaf nodes, the content of the data field is 1000, 0100, 0010, and 0001, and then the number of original high-order combination features corresponding to 1000, 0100, 0010, and 0001 is counted as K respectively1,1、K1,2、K1,3And K1,4Then the current saturation of the 1 st initial leaf node
Figure BDA0002425298670000113
Current saturation of 2 nd initial leaf node
Figure BDA0002425298670000114
Current saturation of 3 rd initial leaf node
Figure BDA0002425298670000115
Current saturation of 4 th initial leaf node
Figure BDA0002425298670000116
S5013: and calculating the standard deviation of the current saturation of the same initial leaf node in at least two time feature groups to obtain a saturation analysis result corresponding to each sample feature.
In this example, a standard deviation calculation is performed on the current saturation corresponding to the same initial leaf node in the T time feature groups, that is, the current saturation corresponding to the same initial leaf node in the T time feature groups is calculated by using a standard deviation calculation formula to determine a saturation standard deviation corresponding to the initial leaf node, and the saturation standard deviation is determined as a saturation analysis result corresponding to the initial leaf node. Generally speaking, the smaller the standard deviation of saturation, the more uniform and stable the initial leaf node in all the original high-order combination characteristics in time distribution is, and the smaller the fluctuation is; the larger the standard deviation of the saturation degree is, the more uneven and stable the time distribution of the initial leaf node in all the original high-order combination characteristics is, and the fluctuation is larger.
For example, when the saturation analysis result of the 1 st initial leaf node is counted, the current saturation of the 1 st initial leaf node in the 1 st 1 … … T time feature group needs to be subjected to the normalized difference calculation, that is, the P is calculated by using the standard deviation calculation formula1,1、P2,1……PT,1And performing standard deviation calculation, obtaining saturation standard deviation … … of the 1 st initial leaf node, and so on, obtaining saturation standard deviations corresponding to all the initial leaf nodes in the initial forest model, and thus obtaining a saturation analysis result corresponding to each sample characteristic.
S502: and performing importance analysis based on the sample label and the original high-order combination characteristics to obtain an importance analysis result corresponding to each sample characteristic.
The importance analysis is a process for determining the degree of importance of the influence of a certain sample feature on the purpose of model training. The importance analysis result corresponding to each sample feature can be understood as the degree of influence of analyzing a certain sample feature on the purpose of model training. For example, if the model training is intended to analyze whether the user accesses the user portrait analysis model of the system at a feature time (e.g., early morning), it can be analyzed whether the user has a higher influence on the system access by the sample feature of the user's profession than by the sample feature of the gender; for another example, when analyzing whether the user purchases a certain product, it can be analyzed that a sample feature of the gender of the user has a higher influence on whether the user purchases the product than a sample feature of the occupation. Generally, if any sample feature has a higher influence on the model training target, the more the importance analysis result is better, the more the leaf nodes corresponding to the corresponding sample feature need to be retained, so that the importance analysis can be performed on the original high-order combination features based on the model training purpose determined by the sample label to determine the importance analysis result corresponding to each sample feature.
In a specific embodiment, the step S502, namely, performing importance analysis based on the sample label and the original high-order combined feature, and obtaining an importance analysis result corresponding to each sample feature, specifically includes the following steps:
s5021: and counting the third feature quantity of the original high-order combination features in the initial leaf nodes corresponding to the same sample feature from the original high-order combination features matched with the sample label and the model training purpose, and determining the sample feature value with the maximum third feature quantity as a standard feature value corresponding to the sample feature.
In the example, only the original high-order combination characteristics of the sample labels matched with the model training purpose are analyzed, so that the target forest model obtained through final training can be ensured to more accurately extract the model training sample related to the model training purpose from the characteristics of the multiple samples, and the accuracy and timeliness of subsequent model training are favorably ensured. For example, if the model training purpose is to analyze whether the user intends to purchase a certain product, the extracted sample label is required to perform subsequent analysis for the purchased original high-order combined features, and the sample label is not extracted to perform subsequent analysis for the original high-order combined features which are not purchased.
As an example, the original high-order composite feature includes N data fields, and the size of the ith data field is
Figure BDA0002425298670000131
Each data field corresponding to a sample characteristic, the size of the ith data field
Figure BDA0002425298670000132
Corresponding to the number of initial leaf nodes corresponding to the ith sample feature. Therefore, step S5021 may specifically include the following steps: firstly, the third feature quantity L of the original high-order combination features in the initial leaf node corresponding to the ith sample feature can be counted, namely the third feature quantity L of the original high-order combination features classified in the jth initial leaf node corresponding to the ith sample feature is countedi,jFor example, if the size of the 1 st data field is 4, which corresponds to 4 initial leaf nodes, and the contents of the data fields are 1000, 0100, 0010, and 0001, respectively, the first feature quantity L with the sample feature value of 1000 in the first data field is counted1,1And a first characteristic quantity L of 0100 sample characteristic value in the first data field1,2And a first characteristic quantity L with a sample characteristic value of 0010 in the first data field1,3And a first characteristic quantity L of the first data field with a sample characteristic value of 00011,4. Then, the sample characteristic value with the maximum third characteristic quantity is determined as the standard characteristic value corresponding to the sample characteristic, namely L1,1、L1,2、L1,3And L1,4The sample feature value corresponding to the medium maximum value is determined as the standard feature value, for example, if L1,1And the value is the maximum value, 1000 is determined as the standard characteristic value of the 1 st sample characteristic. It will be appreciated that each sample feature corresponds to a standard feature value in the form of 0/1.
S5022: and determining a current correlation coefficient of each sample characteristic by using a sample characteristic value and a standard characteristic value corresponding to each sample characteristic in the original high-order combined characteristics.
The current correlation coefficient is obtained by performing correlation calculation on a sample characteristic value corresponding to the sample characteristic in each original high-order combined characteristic and a standard characteristic value. Since the sample characteristic value and the standard characteristic value corresponding to each sample characteristic in the original high-order combined characteristic are binary data of 0 and 1, the correlation between the sample characteristic value and the standard characteristic value can be judged by using a Jaccard coefficient
Figure BDA0002425298670000133
J (a, B) is a current correlation coefficient, a and B respectively refer to two standard eigenvalues and sample eigenvalues which need to be subjected to correlation calculation, M00 indicates the number of corresponding bits of a and B both being 0, M01 indicates the number of corresponding bits of a being 0, B corresponding bits being 1, M10 indicates the number of corresponding bits of a being 1, B corresponding bits being 0, and M11 indicates the number of corresponding bits of a and B both being 1.
S5023: and calculating the standard deviation of the current correlation coefficients corresponding to all the original high-order combination characteristics to obtain the importance analysis result corresponding to each sample characteristic.
In this example, a standard deviation calculation formula may be used to perform standard deviation calculation on the correlation coefficient values corresponding to the same sample feature in all the original high-order combination features to determine the importance standard deviation corresponding to the sample feature, and determine the importance standard deviation as the importance analysis result corresponding to the sample feature. Generally, the smaller the standard deviation of importance of any sample feature is, the more uniform and stable the distribution of sample feature values corresponding to the sample feature in all the original high-order combination features is, the smaller the fluctuation is; on the contrary, the larger the standard deviation of importance of any sample feature is, the more uneven and stable the distribution of the sample feature values corresponding to the sample feature in all the original high-order combination features is, and the larger the fluctuation is.
S503: and if the saturation analysis result conforms to the saturation standard threshold and the importance analysis result conforms to the importance standard threshold, determining the initial leaf node in the initial forest model corresponding to the sample characteristics as an effective leaf node.
Wherein, the saturation standard threshold is a preset threshold used for evaluating whether the saturation reaches the standard or not. The importance analysis threshold is a preset threshold for evaluating whether the importance is up to standard. In this example, the saturation analysis result is compared with a saturation standard threshold, and the importance analysis result is compared with an importance standard threshold; if the importance analysis result corresponding to any sample feature meets the importance standard threshold, namely the importance standard deviation of the sample feature is smaller than the importance standard threshold, the sample feature value corresponding to the sample feature is stable and has small fluctuation; if the saturation analysis result of any initial leaf node meets the saturation standard threshold, namely the saturation standard deviation of the initial leaf node is smaller than the saturation standard threshold, the initial leaf node is uniform and stable in time distribution and small in fluctuation; in this example, the initial leaf node in the initial forest model corresponding to the sample feature is determined as an effective leaf node, so that the effective leaf node is subsequently used to prune the initial forest model to form an effective forest model. The valid leaf node can be understood as an initial leaf node of which the saturation analysis result conforms to the saturation standard threshold in the feature tree corresponding to the sample feature of which the importance analysis result conforms to the importance standard threshold.
In the sample model training method provided by this embodiment, saturation analysis is performed based on the time tag and the original high-order combination features, so that the obtained saturation analysis result can determine whether all the original high-order combination features are uniform and stable from a time distribution angle; performing importance analysis based on the sample label and the original high-order combination characteristics so that the obtained importance analysis result can reflect that the original high-order combination characteristics are matched with the model training purpose, and determining whether the data distribution is uniform and stable; and determining the initial leaf node of which the saturation analysis result accords with the saturation standard threshold and the importance analysis result accords with the importance standard threshold as an effective leaf node so as to comprehensively consider the saturation analysis result and the importance analysis result, thereby removing the initial leaf node with larger fluctuation, and causing the problem that the accuracy of the output result of the stable DeepFM model or the DeepFM model obtained by training cannot be learned due to overfitting of the model training sample output based on the effective forest model in the subsequent model training process.
In an embodiment, as shown in fig. 6, step S205, namely, performing LR regularization screening based on the sample labels and the effective high-order combination features, and determining the target leaf node specifically includes the following steps:
s601: and dividing all effective high-order combinations into a training set and a verification set, performing LR modeling based on effective high-order combination characteristics in the training set, and adjusting an L2 regularization coefficient to enable the AUC of the effective high-order combination characteristics in the verification set to be maximum so as to obtain a target LR model.
As an example, all valid high order combining features may be relied on 7: 3, dividing the effective high-order combination characteristics in the training set into a training set and a verification set, taking all the effective high-order combination characteristics in the training set and corresponding sample labels as output, performing LR modeling, adjusting an L2 regularization coefficient to enable a target LR model determined by the LR modeling to be smoother, and enabling AUC formed by the effective high-order combination characteristics in the verification set in the target LR model to be maximum so as to complete the modeling process of the target LR model.
For example, the regularization coefficients of L2 may be adjusted using an iterative grid search such that the AUC in the validation set is maximized, e.g., first try 1, 0.1, 0.01, 0.001, 0.0001, and find that 0.001 is the last, then the next set of experimental designs is around 0.001, e.g., [0.0005, 0.001, 0.002 ]; this step is repeated several times until there is no significant boost, the modeling process of the target LR model is completed, and the given L2 regularization coefficients are determined. Where auc (area Under cut) is defined as the area enclosed by the coordinate axis Under the ROC curve, it is obvious that the value of this area is not larger than 1. Since the ROC curve is generally located above the line y ═ x, the AUC ranges between 0.5 and 1. The closer the AUC is to 1.0, the higher the authenticity of the detection method is; and when the value is equal to 0.5, the authenticity is lowest, and the application value is not high. The L2 regularized model is called Ridge regression (Ridge regression), which can prevent model overfitting.
S602: and acquiring an absolute value of an LR coefficient corresponding to each effective leaf node in the effective forest model based on the target LR model.
In this example, for a target LR model corresponding to an effective forest model determined by a given L2 regularization coefficient, an LR coefficient corresponding to each effective leaf node in the effective forest model is determined according to the target LR model, so as to obtain an absolute value corresponding to the LR coefficient. The LR coefficients in this example can be coefficients associated with a variable in the target LR model determined after LR modeling, the variable corresponding to a valid leaf node.
S603: and selecting a preset number of effective leaf nodes with larger absolute values of the LR coefficients, and determining the effective leaf nodes as target leaf nodes.
The preset number is the preset number of target leaf nodes needing to be reserved and is set as X. If the number of the effective leaf nodes in the effective forest model is 1000, sorting the absolute values of the LR coefficients corresponding to each effective leaf node, selecting the effective leaf node at the front X position, and determining the effective leaf node as a target leaf node so as to ensure that the LR model formed by the selected X target leaf nodes is basically consistent with the AUC of the target LR model formed by all the target leaf nodes, thereby determining the leaf node most related to the sample label.
For example, when the initial forest model constructed based on XGBOOST includes 300 feature trees, if the 300 feature trees have 10000 initial leaf nodes in total, after performing stability screening, an effective forest model including only 8000 effective leaf nodes may be formed. If the number of all valid high-order combination features is 100000, according to 7: 3 dividing the training set into a training set and a verification set; then, adopting 70000 effective high-order combination characteristics in the training set to carry out LR modeling, wherein the 70000 effective high-order combination characteristics form a [70000, 8000] rectangle for feeding a target LR model, and the y dimension is 0/1 vector of [70000, 1 ]; running a target LR model, and checking LR coefficients corresponding to 8000 effective leaf nodes, wherein the LR coefficients have positive or negative values and have large or small values, so that the absolute values of all the LR coefficients are required to be taken; the first X (e.g., X-3000 or 4000) valid leaf nodes may be selected. Then, 30000 effective high-order combination features in the verification set are adopted to verify X effective leaf nodes before the absolute value of the LR coefficient in 8000 effective leaf nodes and 8000 effective leaf nodes, the AUC of the effective leaf nodes and the AUC of the effective leaf nodes in the target LR model are judged to be basically consistent, namely the AUC similarity of the effective leaf nodes and the AUC similarity reaches a similarity threshold value, or the ACU difference value is smaller than a preset difference value, the X effective leaf nodes are determined as the target leaf nodes, the number of the target leaf nodes is smaller, but the effect basically the same as that of all the effective leaf nodes can be achieved, and therefore the fact that the dimensionality of a model training sample output based on the target forest model is smaller is guaranteed, and the accuracy of achieving the purpose of model training is.
In the sample model training method provided by this embodiment, LR modeling is performed by using effective high-order combination features, and a target LR model is made smoother and a result is more accurate by adjusting an L2 regularization coefficient; and then calculating the absolute value of an LR coefficient corresponding to each effective leaf node in the target LR model, selecting a preset number of effective leaf nodes with larger absolute values of the LR coefficients to determine as the target leaf nodes, so that the number of the target leaf nodes in the formed target forest model is further reduced, and the results obtained in the model training process of the model training samples output by the target forest model and the results obtained in the model training process of the model training samples output by the effective forest model are more similar, thereby ensuring that the dimensionality of the model training samples output based on the target forest model is smaller, and not influencing the accuracy of the realization of the model training.
The sample generation method provided by the embodiment of the present invention may be applied to a computer device shown in fig. 1, where the computer device may be a server. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used to store data employed or generated during execution of the sample generation method. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a sample generation method.
In an embodiment, as shown in fig. 7, a sample generation method is provided, which is applied to the computer device shown in fig. 1 as an example, and the sample characteristic method includes:
s701: and acquiring data to be processed, wherein the data to be processed comprises characteristic data corresponding to at least two sample characteristics.
Wherein the data to be processed is unprocessed data used for generating model training samples. The data to be processed includes feature data corresponding to at least two sample features, which correspond to the feature data corresponding to at least two sample features in the original training data in the above embodiment, but the data to be processed does not carry sample labels, and a model training sample needs to be formed first, and then the model training sample is input into a deep fm model for model training to learn the corresponding sample labels.
S702: and inputting characteristic data corresponding to at least two sample characteristics into the target forest model determined by the sample model training method, and determining the target high-order combined characteristics in the One-Hot coding form output by the target forest model as a model training sample of the deep FM model.
In the sample generation method provided by this embodiment, the target forest model determined in the above embodiment is used to convert the feature data of at least two sample features of the data to be processed into a high-order combined feature in One-Hot coding form including at least two data fields, so that a model training sample that can be input into a deep fm model for model training is formed; moreover, because the target forest model is a forest model determined by performing stability screening and LR regularization screening on initial leaf nodes in the initial forest model, dimension reduction is performed through secondary screening, so that the dimension of the formed high-order combination features is low, and when the model training sample output by the target forest model is input into the deep FM model for model training, system resources occupied in the training process can be saved, and training time is shortened; moreover, target leaf nodes in the target forest model are matched with the model training purpose of the DeepFM model, leaf nodes with low stability are filtered, the phenomenon that overfitting occurs in the model training sample output by the target forest model in the DeepFM model training process is reduced, system resources occupied in the model training process are saved, and the accuracy of the model training sample for training the DeepFM model is improved.
It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.
In an embodiment, a sample model training apparatus is provided, and the sample model training apparatus corresponds to the sample model training method in the above embodiment one to one. As shown in fig. 8, the sample model training apparatus includes the following functional modules, which are described in detail as follows:
an original training data obtaining module 801, configured to obtain original training data, where the original training data includes a sample label and feature data corresponding to at least two sample features.
The original high-order combined feature obtaining module 802 is configured to input original training data into an initial forest model constructed based on a tree model, and obtain original high-order combined features in an One-Hot coding form corresponding to the original training data, where the initial forest model includes at least two feature trees arranged in sequence, each feature tree corresponds to a sample feature and includes at least two initial leaf nodes.
And the effective forest model obtaining module 803 is used for performing stability screening based on the sample label and the original high-order combination characteristic, determining effective leaf nodes, and performing pruning on the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain the effective forest model.
And the effective high-order combined feature obtaining module 804 is used for inputting the original training data into the effective forest model and obtaining effective high-order combined features of the One-Hot coding form corresponding to the original training data.
And the target forest model obtaining module 805 is used for performing LR regularized screening based on the sample labels and the effective high-order combination characteristics, determining target leaf nodes, and performing pruning on the effective leaf nodes in the effective forest model based on the target leaf nodes to obtain the target forest model.
Preferably, the original high-order combination feature obtaining module 802 includes:
and the initial characteristic acquisition unit is used for respectively inputting the characteristic data corresponding to at least two sample characteristics into the characteristic trees corresponding to the sample characteristics for processing to acquire the initial characteristics corresponding to the sample characteristics.
And the initial feature splicing unit is used for splicing the initial features corresponding to the at least two sample features according to the arrangement sequence of the at least two feature trees to obtain the original high-order combination features of the One-Hot coding form corresponding to the original training data.
Preferably, the raw training data further comprises a time tag. An effective forest model obtaining module 803, including:
and the saturation analysis result acquisition unit is used for carrying out saturation analysis based on the time labels and the original high-order combination characteristics and acquiring a saturation analysis result corresponding to each sample characteristic.
And the importance analysis result acquisition unit is used for carrying out importance analysis based on the sample label and the original high-order combination characteristic and acquiring an importance analysis result corresponding to each sample characteristic.
And the effective leaf node obtaining unit is used for determining the initial leaf node in the initial forest model corresponding to the sample characteristics as the effective leaf node if the saturation analysis result conforms to the saturation standard threshold and the importance analysis result conforms to the importance standard threshold.
Preferably, the saturation analysis result acquisition unit includes:
and the time feature group dividing subunit is used for grouping the original high-order combination features corresponding to the time labels based on the time grouping period to obtain at least two time feature groups.
The current saturation acquiring subunit is configured to count a first feature quantity of original high-order combination features in the time feature group, count a second feature quantity of the original high-order combination features in an initial leaf node corresponding to the same sample feature in the time feature group, and determine the current saturation of each initial leaf node based on the first feature quantity and the second feature quantity.
And the saturation analysis result acquisition subunit is used for calculating the standard deviation of the current saturation of the same initial leaf node in at least two time feature groups and acquiring the saturation analysis result corresponding to each sample feature.
Preferably, the importance analysis result acquisition unit includes:
and the standard characteristic value obtaining subunit is used for counting the third characteristic quantity of the original high-order combined characteristics in the initial leaf node corresponding to the same sample characteristic from the original high-order combined characteristics matched with the sample label and the model training purpose, and determining the sample characteristic value with the maximum third characteristic quantity as the standard characteristic value corresponding to the sample characteristic.
And the current correlation coefficient obtaining subunit is used for determining a current correlation coefficient of each sample characteristic according to the sample characteristic value and the standard characteristic value corresponding to each sample characteristic in the original high-order combined characteristic.
And the importance analysis result acquisition subunit is used for calculating the standard deviation of the current correlation coefficients corresponding to all the original high-order combination features and acquiring the importance analysis result corresponding to each sample feature.
Preferably, the target forest model obtaining module 805 includes:
and the target LR model acquisition unit is used for dividing all effective high-order combinations into a training set and a verification set, performing LR modeling based on the effective high-order combination characteristics in the training set, and adjusting an L2 regularization coefficient to enable the AUC of the effective high-order combination characteristics in the verification set to be maximum so as to acquire a target LR model.
And the coefficient absolute value acquisition unit is used for acquiring the absolute value of the LR coefficient corresponding to each effective leaf node in the effective forest model based on the target LR model.
And the target leaf node acquisition unit is used for selecting a preset number of effective leaf nodes with larger absolute values of the LR coefficients and determining the effective leaf nodes as the target leaf nodes.
For the specific definition of the sample model training device, reference may be made to the above definition of the sample model training method, which is not described herein again. The modules in the sample model training device can be wholly or partially realized by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a sample generation apparatus is provided, which corresponds to the sample generation method in the above embodiments one to one. As shown in fig. 9, the sample generation apparatus includes the following functional blocks, each of which is described in detail as follows:
a to-be-processed data obtaining module 901, configured to obtain to-be-processed data, where the to-be-processed data includes feature data corresponding to at least two sample features.
A model training sample obtaining module 902, configured to input feature data corresponding to at least two sample features into the target forest model determined by the sample model training method, and determine a target high-order combination feature in a One-Hot coding form output by the target forest model as a model training sample of the deep fm model.
In an embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the sample model training method in the foregoing embodiments is implemented, for example, as shown in S201 to S205 in fig. 2, or as shown in fig. 3, 5, and 6, and is not described here again to avoid repetition. Alternatively, the processor implements the functions of each module/unit in the embodiment of the sample model training apparatus when executing the computer program, for example, the functions of each module/unit shown in fig. 8, and are not described herein again to avoid repetition. Alternatively, when the processor executes the computer program, the sample generation method in the foregoing embodiments is implemented, for example, in S701-S702 shown in fig. 7, and details are not repeated here to avoid repetition. Alternatively, the processor implements the functions of each module/unit in the embodiment of the sample generation apparatus when executing the computer program, for example, the functions of each module/unit shown in fig. 9, and are not described here again to avoid repetition.
In an embodiment, a computer-readable storage medium is provided, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method for training a sample model in the foregoing embodiments is implemented, for example, S201 to S205 shown in fig. 2, or as shown in fig. 3, fig. 5, and fig. 6, which is not described herein again to avoid repetition. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units in the embodiment of the sample model training apparatus, such as the functions of fig. 8, which are not described herein again to avoid redundancy. Alternatively, the computer program is executed by the processor to implement the sample generation method in the above embodiments, for example, S701-S702 shown in fig. 7, and details are not repeated here to avoid repetition. Alternatively, the computer program, when executed by the processor, implements the functions of the modules/units in the embodiment of the sample generation apparatus, for example, the functions of the modules/units shown in fig. 9, and are not described herein again to avoid repetition.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.
The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims (10)

1. A sample model training method is characterized by comprising the following steps:
acquiring original training data, wherein the original training data comprises a sample label and characteristic data corresponding to at least two sample characteristics;
inputting the original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combination features of a One-Hot coding form corresponding to the original training data, wherein the initial forest model comprises at least two feature trees which are sequentially arranged, each feature tree corresponds to One sample feature and comprises at least two initial leaf nodes;
performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain an effective forest model;
inputting the original training data into the effective forest model, and acquiring effective high-order combination characteristics of One-Hot coding form corresponding to the original training data;
and performing LR regularized screening based on the sample label and the effective high-order combination characteristics, determining a target leaf node, and performing pruning on the effective leaf node in the effective forest model based on the target leaf node to obtain the target forest model.
2. The sample model training method of claim 1, wherein the raw training data further comprises a time tag;
the stability screening based on the sample label and the original high-order combination characteristics to determine effective leaf nodes comprises:
performing saturation analysis based on the time labels and the original high-order combined features to obtain a saturation analysis result corresponding to each sample feature;
performing importance analysis based on the sample label and the original high-order combined features to obtain an importance analysis result corresponding to each sample feature;
and if the saturation analysis result accords with a saturation standard threshold and the importance analysis result accords with an importance standard threshold, determining an initial leaf node in the initial forest model corresponding to the sample characteristics as an effective leaf node.
3. The method for training the sample model according to claim 2, wherein the performing saturation analysis based on the time tag and the original high-order combined features to obtain a saturation analysis result corresponding to each sample feature comprises:
grouping original high-order combination characteristics corresponding to the time labels based on a time grouping period to obtain at least two time characteristic groups;
counting a first feature quantity of original high-order combination features in the time feature group, counting a second feature quantity of the original high-order combination features in initial leaf nodes corresponding to the same sample feature in the time feature group, and determining the current saturation of each initial leaf node based on the first feature quantity and the second feature quantity;
and calculating the standard deviation of the current saturation of the same initial leaf node in at least two time feature groups to obtain a saturation analysis result corresponding to each sample feature.
4. The method for training the sample model according to claim 2, wherein the performing importance analysis based on the sample label and the original high-order combined feature to obtain an importance analysis result corresponding to each sample feature comprises:
counting a third feature quantity of original high-order combination features in an initial leaf node corresponding to the same sample feature from the original high-order combination features matched with the sample label and the model training purpose, and determining a sample feature value with the maximum third feature quantity as a standard feature value corresponding to the sample feature;
determining a current correlation coefficient of each sample characteristic according to a sample characteristic value corresponding to each sample characteristic in original high-order combined characteristics and the standard characteristic value;
and calculating the standard deviation of the current correlation coefficients corresponding to all the original high-order combination characteristics to obtain the importance analysis result corresponding to each sample characteristic.
5. The sample model training method of claim 1, wherein the performing LR regularization screening based on the sample labels and the valid high-order combination features to determine target leaf nodes comprises:
dividing all effective high-order combinations into a training set and a verification set, performing LR modeling based on effective high-order combination characteristics in the training set, and adjusting an L2 regularization coefficient to enable the AUC of the effective high-order combination characteristics in the verification set to be maximum so as to obtain a target LR model;
acquiring an absolute value of an LR coefficient corresponding to each effective leaf node in the effective forest model based on the target LR model;
and selecting a preset number of effective leaf nodes with larger absolute values of the LR coefficients, and determining the effective leaf nodes as target leaf nodes.
6. A method of generating a sample, comprising:
acquiring data to be processed, wherein the data to be processed comprises characteristic data corresponding to at least two sample characteristics;
inputting characteristic data corresponding to at least two sample characteristics into a target forest model determined by the sample model training method of any One of claims 1 to 5, and determining target high-order combined characteristics in a One-Hot coding form output by the target forest model as a model training sample of the DeepFM model.
7. A sample model training apparatus, comprising:
the system comprises an original training data acquisition module, a data processing module and a data processing module, wherein the original training data acquisition module is used for acquiring original training data which comprises sample labels and characteristic data corresponding to at least two sample characteristics;
the original high-order combined feature acquisition module is used for inputting the original training data into an initial forest model constructed based on a tree model, and acquiring original high-order combined features of an One-Hot coding form corresponding to the original training data, wherein the initial forest model comprises at least two feature trees which are sequentially arranged, each feature tree corresponds to One sample feature and comprises at least two initial leaf nodes;
the effective forest model obtaining module is used for performing stability screening based on the sample label and the original high-order combination characteristics, determining effective leaf nodes, and performing lopping on the initial leaf nodes of the initial forest model based on the effective leaf nodes to obtain an effective forest model;
the effective high-order combined feature acquisition module is used for inputting the original training data into the effective forest model and acquiring effective high-order combined features of an One-Hot coding form corresponding to the original training data;
and the target forest model acquisition module is used for performing LR regularized screening based on the sample label and the effective high-order combination characteristics, determining target leaf nodes, and performing pruning on the effective leaf nodes in the effective forest model based on the target leaf nodes to acquire the target forest model.
8. A sample generation device, comprising:
the device comprises a to-be-processed data acquisition module, a to-be-processed data acquisition module and a to-be-processed data processing module, wherein the to-be-processed data acquisition module is used for acquiring to-be-processed data which comprises characteristic data corresponding to at least two sample characteristics;
a model training sample obtaining module, configured to input feature data corresponding to at least two sample features into a target forest model determined by the sample model training method according to any One of claims 1 to 5, and determine a target high-order combination feature in a One-Hot coding form output by the target forest model as a model training sample of the deep fm model.
9. A computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the computer program, implements the sample model training method of any one of claims 1 to 5; alternatively, the processor, when executing the computer program, implements the sample generation method of claim 6.
10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a method of training a sample model according to any one of claims 1 to 5; alternatively, the computer program, when executed by a processor, implements the sample generation method of claim 6.
CN202010218666.XA 2020-03-25 2020-03-25 Sample model training method, sample generation method, device, equipment and medium Pending CN111581877A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010218666.XA CN111581877A (en) 2020-03-25 2020-03-25 Sample model training method, sample generation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010218666.XA CN111581877A (en) 2020-03-25 2020-03-25 Sample model training method, sample generation method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN111581877A true CN111581877A (en) 2020-08-25

Family

ID=72124170

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010218666.XA Pending CN111581877A (en) 2020-03-25 2020-03-25 Sample model training method, sample generation method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN111581877A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035453A (en) * 2020-08-27 2020-12-04 平安科技(深圳)有限公司 GBDT high-order feature combination-based recommendation method and device and storage medium
CN112288025A (en) * 2020-11-03 2021-01-29 中国平安财产保险股份有限公司 Abnormal case identification method, device and equipment based on tree structure and storage medium
CN112749924A (en) * 2021-02-01 2021-05-04 深圳无域科技技术有限公司 Wind control model training method, system, equipment and computer readable medium
CN113268335A (en) * 2021-06-24 2021-08-17 中国平安人寿保险股份有限公司 Model training and execution duration estimation method, device, equipment and storage medium
CN113516513A (en) * 2021-07-20 2021-10-19 重庆度小满优扬科技有限公司 Data analysis method and device, computer equipment and storage medium
CN113599647A (en) * 2021-08-18 2021-11-05 深圳先进技术研究院 Ventilation mode matching method and device for mechanical ventilation of respirator and related equipment
CN114140723A (en) * 2021-12-01 2022-03-04 北京有竹居网络技术有限公司 Multimedia data identification method and device, readable medium and electronic equipment

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112035453A (en) * 2020-08-27 2020-12-04 平安科技(深圳)有限公司 GBDT high-order feature combination-based recommendation method and device and storage medium
CN112035453B (en) * 2020-08-27 2024-03-12 平安科技(深圳)有限公司 Recommendation method, device and storage medium based on GBDT high-order feature combination
CN112288025A (en) * 2020-11-03 2021-01-29 中国平安财产保险股份有限公司 Abnormal case identification method, device and equipment based on tree structure and storage medium
CN112288025B (en) * 2020-11-03 2024-04-30 中国平安财产保险股份有限公司 Abnormal case identification method, device, equipment and storage medium based on tree structure
CN112749924A (en) * 2021-02-01 2021-05-04 深圳无域科技技术有限公司 Wind control model training method, system, equipment and computer readable medium
CN113268335A (en) * 2021-06-24 2021-08-17 中国平安人寿保险股份有限公司 Model training and execution duration estimation method, device, equipment and storage medium
CN113268335B (en) * 2021-06-24 2023-02-07 中国平安人寿保险股份有限公司 Model training and execution duration estimation method, device, equipment and storage medium
CN113516513A (en) * 2021-07-20 2021-10-19 重庆度小满优扬科技有限公司 Data analysis method and device, computer equipment and storage medium
CN113599647A (en) * 2021-08-18 2021-11-05 深圳先进技术研究院 Ventilation mode matching method and device for mechanical ventilation of respirator and related equipment
CN113599647B (en) * 2021-08-18 2024-02-13 深圳先进技术研究院 Ventilation pattern matching method, device and related equipment for mechanical ventilation of breathing machine
CN114140723A (en) * 2021-12-01 2022-03-04 北京有竹居网络技术有限公司 Multimedia data identification method and device, readable medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN111581877A (en) Sample model training method, sample generation method, device, equipment and medium
CN110147445A (en) Intension recognizing method, device, equipment and storage medium based on text classification
CN110458324B (en) Method and device for calculating risk probability and computer equipment
CN113379301A (en) Method, device and equipment for classifying users through decision tree model
CN112329843B (en) Call data processing method, device, equipment and storage medium based on decision tree
CN112699923A (en) Document classification prediction method and device, computer equipment and storage medium
CN113762392A (en) Financial product recommendation method, device, equipment and medium based on artificial intelligence
CN111209929A (en) Access data processing method and device, computer equipment and storage medium
CN113963205A (en) Classification model training method, device, equipment and medium based on feature fusion
CN114693990A (en) Small sample crop disease identification method based on feature extraction and storage medium
CN114881343B (en) Short-term load prediction method and device for power system based on feature selection
CN115391561A (en) Method and device for processing graph network data set, electronic equipment, program and medium
CN111783827B (en) Enterprise user classification method and device based on load data
CN111552810B (en) Entity extraction and classification method, entity extraction and classification device, computer equipment and storage medium
CN114357284A (en) Crowdsourcing task personalized recommendation method and system based on deep learning
CN111178498B (en) Stock fluctuation prediction method and device
CN117408736A (en) Enterprise fund demand mining method and medium based on improved Stacking fusion algorithm
CN112035775A (en) User identification method and device based on random forest model and computer equipment
CN110766465A (en) Financial product evaluation method and verification method and device thereof
CN115481728A (en) Transmission line defect detection method, model pruning method, equipment and medium
CN114463673B (en) Material recommendation method, device, equipment and storage medium
CN114048854B (en) Deep neural network big data internal data file management method
CN114283429A (en) Material work order data processing method, device, equipment and storage medium
CN114238798A (en) Search ranking method, system, device and storage medium based on neural network
CN113239272A (en) Intention prediction method and intention prediction device of network management and control system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination