CN112200271A - Training sample determination method and device, computer equipment and storage medium - Google Patents

Training sample determination method and device, computer equipment and storage medium Download PDF

Info

Publication number
CN112200271A
CN112200271A CN202011288666.3A CN202011288666A CN112200271A CN 112200271 A CN112200271 A CN 112200271A CN 202011288666 A CN202011288666 A CN 202011288666A CN 112200271 A CN112200271 A CN 112200271A
Authority
CN
China
Prior art keywords
target
training
sample
samples
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011288666.3A
Other languages
Chinese (zh)
Inventor
熊伟灼
杨青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Du Xiaoman Technology Beijing Co Ltd
Original Assignee
Shanghai Youyang New Media Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Youyang New Media Information Technology Co ltd filed Critical Shanghai Youyang New Media Information Technology Co ltd
Priority to CN202011288666.3A priority Critical patent/CN112200271A/en
Publication of CN112200271A publication Critical patent/CN112200271A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • G06F18/24137Distances to cluster centroïds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Software Systems (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a training sample determination method, a training sample determination device, computer equipment and a storage medium, wherein a target training sample subjected to dimension reduction processing of each training sample in a training sample set is determined; acquiring at least one target reference sample cluster and proportion information obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in the reference sample set, wherein the reference samples are later than the training samples; clustering all target training samples according to at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster respectively; and then according to the proportion information, the mode of respectively determining the target training samples for model training from each target training sample cluster solves the problem of model effect attenuation caused by the time difference between the training model and the using model, and simultaneously, the number of the training samples for training the model can be reduced through sampling, so that the model training speed is accelerated.

Description

Training sample determination method and device, computer equipment and storage medium
Technical Field
The present invention relates to the field of computer technologies, and in particular, to a training sample determination method and apparatus, a computer device, and a storage medium.
Background
Generally, the modeling process is shown in fig. 1, which mainly involves 5 steps of data acquisition, sample screening, data cleaning, feature engineering and modeling.
In the prior art, all available training samples are screened from data for machine learning modeling by sample screening according to business logic, because the statistical modeling is also parameter estimation of a model, the larger the training sample amount aiming at the parameter estimation problem is, the smaller the parameter estimation error is possibly, and the higher the possibility of parameter estimation accuracy is.
However, since the above conclusion depends on the assumption that the training samples are independently and uniformly distributed, and the assumption may not be satisfied in practical situations, the model performance is often reduced. And, as the traffic is accumulated, the amount of training samples used is gradually increased, resulting in a longer training time of the model.
Disclosure of Invention
In view of the above, the present invention provides a training sample determining method, apparatus, computer device and storage medium to reduce the attenuation of model performance and improve the training efficiency of the model. The technical scheme is as follows:
a training sample determination method, comprising:
determining a target training sample after dimension reduction processing of each training sample in a training sample set;
acquiring at least one target reference sample cluster and proportion information, wherein the target reference sample cluster is obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in a reference sample set, the proportion information represents the proportion of the number of target reference samples in a first reference sample cluster and a second reference sample cluster in the at least one target reference sample cluster, and the reference samples are later than the training samples;
clustering all the target training samples according to the at least one target reference sample cluster to obtain target training sample clusters corresponding to the target reference sample clusters respectively;
and respectively determining target training samples for model training from each target training sample cluster according to the proportion information.
Preferably, the determining the target training sample after the dimension reduction processing of each training sample in the training sample set includes:
determining a training sample set composed of a plurality of training samples, the training samples being indicative of a plurality of dimensional features of a user;
performing feature classification on the multiple dimensional features indicated by the training sample to obtain at least one feature group indicated by the training sample, wherein different feature groups belong to different feature classes; one dimension feature belongs to only one feature group;
inputting the feature group into a pre-trained feature information determination model corresponding to the feature class to which the feature group belongs to obtain feature information for representing the feature group;
and the feature information of at least one feature group indicated by the training sample constitutes the target training sample after the dimension reduction processing of the training sample.
Preferably, the generation process of the feature information determination model corresponding to the target feature class includes:
determining a first sample for training the feature information determination model;
performing feature classification on a plurality of dimensional features of the first sample indication to obtain at least one feature group of the first sample indication;
generating a second sample from the group of features indicated by the first sample that belong to the target feature class;
and training the characteristic information determination model to be trained by using the second sample to generate a characteristic information determination model corresponding to the target characteristic category.
Preferably, the obtaining at least one target reference sample cluster and ratio information obtained by clustering the target reference samples subjected to the dimensionality reduction processing on the reference samples in the reference sample set includes:
determining a target reference sample after dimension reduction processing of each reference sample in a reference sample set;
clustering all the target reference samples to obtain at least one target reference sample cluster;
and generating proportion information according to the number of the target reference samples in each target reference sample cluster.
Preferably, the clustering all the target training samples according to the at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster respectively includes:
calculating the distance between the target training sample and the central point of each target reference sample cluster in the at least one target reference sample cluster respectively;
determining a target reference sample cluster to which the target training sample belongs according to the distance between the target training sample and the central point of each target reference sample cluster in the at least one target reference sample cluster;
and determining all target training samples belonging to the same target reference sample cluster as a target training sample cluster corresponding to the target reference sample cluster.
Preferably, the determining the target training samples for model training from each target training sample cluster according to the ratio information includes:
respectively determining the sampling quantity of each target training sample cluster according to the quantity of the target training samples in each target training sample cluster according to the proportion information;
and extracting the target training samples of the sampling number of the target training sample cluster from the target training sample cluster according to the sequence of the distance from the central point of the target training sample cluster from near to far.
Preferably, the determining the number of samples of each target training sample cluster according to the number of target training samples in each target training sample cluster according to the ratio information includes:
determining the number of target training samples in each target training sample cluster;
respectively determining the sampling quantity of each target training sample cluster according to the quantity of target training samples in each target training sample cluster;
the ratio between the sampling number of the first target training sample cluster and the sampling number of the second target training sample cluster is the same as the ratio between the number of the target reference samples in the target reference sample cluster corresponding to the first target training sample cluster and the number of the target reference samples in the target reference sample cluster corresponding to the second target training sample cluster represented by the ratio information.
A training sample determination apparatus comprising:
the target training sample determining unit is used for determining a target training sample after dimension reduction processing of each training sample in the training sample set;
the target reference sample cluster determining unit is used for acquiring at least one target reference sample cluster and proportion information, wherein the target reference sample cluster is obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in a reference sample set, the proportion information represents the proportion of the number of target reference samples in a first reference sample cluster and a second reference sample cluster in the at least one target reference sample cluster, and the reference samples are later than the training samples;
a target training sample cluster determining unit, configured to perform clustering processing on all the target training samples according to the at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster;
and the sample sampling unit is used for respectively determining target training samples for model training from each target training sample cluster according to the proportion information.
A computer device, comprising: the system comprises a processor and a memory, wherein the processor and the memory are connected through a communication bus; the processor is used for calling and executing the program stored in the memory; the memory is used for storing a program used for realizing the training sample determination method.
A computer-readable storage medium, having stored thereon a computer program which, when loaded and executed by a processor, carries out the steps of the training sample determination method.
The embodiment of the application provides a training sample determination method, a training sample determination device, computer equipment and a storage medium, wherein a target training sample subjected to dimension reduction processing of each training sample in a training sample set is determined; acquiring at least one target reference sample cluster and proportion information obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in the reference sample set, wherein the reference samples are later than the training samples; clustering all target training samples according to at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster respectively; and then confirm the way used for training sample of goal of the model from every training sample cluster of goal separately according to the proportion information, have realized the measurement to the similar degree of training sample set and reference sample set, and through sampling the training sample of goal used for training of the model from training the sample set, the customer group distribution consistency while guaranteeing model training and model to apply, thus has solved the situation that causes the model effect to attenuate because of training the time difference of model and using the model, can reduce the quantity of the real training set through sampling and thus accelerate the training speed of the model at the same time.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a schematic diagram of a modeling process of a model provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a simple and efficient modeling process provided by an embodiment of the present application;
fig. 3 is a flowchart of a training sample determination method according to an embodiment of the present disclosure;
fig. 4 is a flowchart of a feature information determination model generation method according to an embodiment of the present application;
fig. 5 is a schematic diagram of a Kmeans clustering sample provided in an embodiment of the present application;
fig. 6 is a schematic diagram of a training sample distribution of a training sample set according to an embodiment of the present disclosure;
FIG. 7 is a schematic diagram of a sampled target training sample according to an embodiment of the present application;
fig. 8 is a schematic structural diagram of a training sample determination apparatus according to an embodiment of the present application;
fig. 9 is a block diagram of an implementation manner of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In the prior art, when a training sample is determined, all available training samples are generally screened from data according to business logic and are subjected to machine learning modeling to generate a model, and based on the assumption that the training samples are independently and uniformly distributed, the more training samples are, the better the training effect on the model is.
However, as the number of training samples increases, the training time of the model becomes longer. Moreover, the assumption of independent uniform distribution may not be satisfied in practical situations, which often causes a problem of reduced model performance.
Especially for the long period model, because the long period model needs to use the actual user performance to evaluate the model effect, the time range of the training sample used for building the model has a long time difference from the observation represented on the actual line of the model, and the guest group when the model is actually applied and the guest group (training sample) used for training the model may have a certain deviation, thereby often causing the model performance to be attenuated.
For example, the long-period model may be a wind control model, and data may need to be observed for at least one year to be used as a training sample for training the wind control model, so that if the wind control model is trained at 2020.05.05, the training sample is at least data before 2019.05.05, and the trained wind control model is actually applied to risk control and risk indication for data after 2020.05.05, and a certain deviation exists between data after 2020.05.05 and data before 2019.05.05, which often causes a situation that the wind control model attenuates risk control and risk indication for data after 2020.050.05.
Therefore, the embodiment of the application provides a training sample determination method, a training sample determination device, computer equipment and a storage medium, solves the problem that long-period model performance is attenuated due to time difference existing between long-period model application and long-period model building, ensures that passenger groups used for training a long-period model are consistent with passenger group distribution when the long-period model is actually applied by sampling a training sample set, and reduces the condition that the long-period model performance is attenuated; meanwhile, the training samples are screened, so that the number of the training samples is reduced, and the training speed of the long-period model is increased.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.
Before describing a training sample determination method provided in the embodiments of the present application in detail, technical terms to which the training sample determination method provided in the present application is applied will be described.
XGB, an ensemble learning algorithm
Kmeans, an unsupervised classification algorithm
And an OOT set which is a data set not available for model training, wherein the time sequence of the data set is later than that of a training sample set used for model training.
The method provides a simple and efficient modeling process, and with reference to fig. 2, all features (in the ten thousands dimensions) of a user are divided into a plurality of large classes, such as credit class features, fraud class features, interest class features and the like, then a machine learning modeling is performed by using an XGB model, a sub-model (feature information determination model) is established for each class of features, then an OOT set is used for performing the same preprocessing (classification and determination of feature information by using the XGB sub-model trained before), then an unsupervised classification model is trained, and finally a new training sample set is formed by sampling from the training sample set in the same proportion according to the proportion of each class.
A detailed description is given below to a flowchart of a training sample determination method provided in an embodiment of the present application, specifically referring to fig. 3.
As shown in fig. 3, the method includes:
s301, determining a target training sample after dimension reduction processing of each training sample in a training sample set;
in the embodiment of the application, the training sample set is composed of a plurality of training samples, and the training samples indicate a plurality of dimensional features of a user. Taking a training sample as an example, performing feature classification on a plurality of dimensional features indicated by the training sample to obtain at least one feature group indicated by the training sample (different feature groups belong to different feature classes, and one dimensional feature only belongs to one feature group); and for each feature group indicated by the training sample, inputting the feature group into a pre-trained feature information determination model corresponding to the feature class to which the feature group belongs to obtain feature information for representing the feature group. Correspondingly, the feature information of all feature groups indicated by the training sample constitutes the target training sample after the dimension reduction processing of the training sample.
Illustratively, a training sample indicates the features of a user, and features of a user are characterized in tens of thousands of dimensions, such as X.
Figure BDA0002783201150000071
Wherein X is a characteristic of a user, X(m)The feature X of the user is the feature of the user in the mth dimension, the feature X of the user is formed by a set of features of the user in the 1 st to mth dimensions, and m is a positive integer greater than or equal to 1.
A training sample includes features of a user in multiple dimensions, and the features of the user in each dimension can be considered as a dimension feature of the user indicated by the training sample, so that the training sample indicates multiple dimension features of the user.
The method comprises the steps of presetting a plurality of feature categories, and presetting the corresponding relation between the feature categories and dimensions for each feature category, wherein one feature category can correspond to one or more dimensions, and one dimension only corresponds to one feature category. Illustratively, the feature categories may be credit-type features, fraud-type features, interest-type features, and the like.
Determining a plurality of dimensional features of a user indicated by a training sample, and determining the dimensional features of the same feature class corresponding to the dimensionalities from the plurality of dimensional features as a feature group. In this way, the purpose of performing feature classification on the multiple dimensional features indicated by the training samples to obtain each feature group indicated by the training samples can be achieved, and each feature group indicated by the training samples is referred to as at least one feature group indicated by the training samples.
For example, for each of at least one feature group indicated by the training sample, a feature class corresponding to the feature group is determined, and the feature group is input to a pre-trained feature information determination model corresponding to the determined feature class to obtain feature group information of the feature group. Thus, the feature set information of at least one feature set indicated by the training sample constitutes a target training sample, which may be considered as a result of the dimension reduction process of the training sample.
In the embodiment of the application, for each feature type, a feature information determination model corresponding to the feature type is preset, and different feature types correspond to different feature information determination models.
Fig. 4 is a flowchart of a feature information determination model generation method according to an embodiment of the present application.
As shown in fig. 4, the method includes:
s401, determining a first sample for training a characteristic information determination model;
in the embodiment of the application, first samples for training the feature information determination model are obtained, one first sample indicates the feature of one user, and the dimension of the feature for characterizing the user in the first sample is the same as the dimension of the feature for characterizing the user in the training sample. Accordingly, the first sample is also indicative of a plurality of dimensional features of the user.
S402, carrying out feature classification on the multiple dimensional features indicated by the first sample to obtain at least one feature group indicated by the first sample;
for example, the feature classification is performed on the multiple dimensional features indicated by the first sample to obtain at least one feature group indicated by the first sample, and the manner of performing the feature classification on the multiple dimensional features indicated by the first sample refers to the above description of performing the feature classification on the multiple dimensional features indicated by the training sample, which is not described in detail herein.
S403, generating a second sample according to the feature group which is indicated by the first sample and belongs to the target feature category;
the feature type corresponding to the feature information determination model to be generated is determined, and for convenience of distinguishing, the feature type corresponding to the feature information determination model to be generated is called a target feature type in the embodiment of the application. In this way, a second sample is generated from the set of features belonging to the target feature class indicated by the first sample.
S404, training the characteristic information determination model to be trained by using the second sample to generate a characteristic information determination model corresponding to the target characteristic category.
In the embodiment of the application, the prediction result of the feature information of the second sample by the feature information determination model to be trained approaches to the target feature information carried by the second sample, and the feature information determination model to be trained is trained to generate the feature information determination model corresponding to the target feature type.
For example, the feature information determination model may be an XGB model, and inputting the feature set into the feature information determination model may eventually output a value of 0 to 1. Each feature class has a value output of 0-1, thereby reducing the training sample X (a user's feature in ten thousand dimensions) to a low-dimensional vector whose dimensions are determined by the number of feature classes, e.g., Xnew(n<<m),XnewAnd carrying out dimensionality reduction on the training sample X to obtain a target training sample.
Figure BDA0002783201150000081
S302, at least one target reference sample cluster and proportion information obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in a reference sample set are obtained, the proportion information represents the proportion of the number of the target reference samples in different target reference sample clusters, and the reference samples are later than training samples;
in the embodiment of the present application, a reference sample set is determined, where the reference sample set may be regarded as the OOT set, samples in the reference sample set may be referred to as reference samples, the reference samples are not used for model training, a generation time of the reference samples is later than a generation time of the training samples, and the generation time of the reference samples is earlier than a real application time of a trained model.
One reference sample indicates the features of one user, and the dimensions of the features in the reference sample for characterizing the user are the same as the dimensions of the features in the training sample for characterizing the user. Accordingly, the reference sample also indicates a plurality of dimensional features of the user.
And performing dimensionality reduction on each reference sample in the reference sample set to obtain a target reference sample, so that one target reference sample can be obtained for each reference sample in the reference sample set, performing clustering processing on all the target reference samples to obtain at least one target reference sample cluster and proportion information, and the proportion information represents the proportion of the number of the target reference samples of the first reference sample cluster and the second reference sample cluster in the at least one target reference sample cluster. Wherein the first reference sample cluster is any one of the at least one target reference sample cluster, and the second reference sample cluster is also a target reference sample cluster of the at least one target reference sample cluster, but the second reference sample cluster is different from the first reference sample cluster.
Taking an example that at least one target reference sample cluster includes 3 target reference sample clusters, and the 3 target reference sample clusters are a target reference sample cluster 1, a target reference sample cluster 2, and a target reference sample cluster 3, respectively, determining the number of target reference samples in the target reference sample cluster 1 (target reference sample number 1), the number of target reference samples in the target reference sample cluster 2 (target reference sample number 2), and the number of target reference samples in the target reference sample cluster 3 (target reference sample number 3), and then the ratio information may be the number of target reference samples 1: target reference sample number 2: target reference sample number 3.
Illustratively, after the dimension reduction processing is performed on each training sample in the training sample set, the target training sample X after the dimension reduction processing of the training sample X is obtainednewAnd then selecting an OOT set, performing dimensionality reduction on the reference samples in the OOT set to obtain target reference samples, and clustering the target reference samples in the OOT set by using Kmeans clustering to obtain a clustering result, wherein the clustering result can be regarded as a Kmeans model. The goal of clustering is to assign all target reference samples to k clusters C ═ C1,C2…CkEach cluster may be referred to as a target reference sample cluster. The center point of each cluster is mui
Figure BDA0002783201150000101
X is a vector characterization of each reference sample, μiA vector characterization of the center of each target reference sample cluster.
FIG. 5 is a sample Kmeans clustering in which all target reference samples are divided into four target reference sample clusters C1,C2,C3,C4The central points of the four target reference sample clusters are respectively mu1(-1,-1),μ2(0,0),μ3(1,1),μ4(2, 2), it can be seen that the assignment to C is3,C4The number of target reference samples of the two target reference sample clusters is significantly greater than the other two target reference sample clusters.
S303, clustering all target training samples according to at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster respectively;
illustratively, the training is performed by the previous step using the target training samplePredicting by using the Kmeans model, and training all target samples such as xiDividing the target sample into K target reference sample clusters by calculating xiAnd the minimum distance from each target reference sample cluster is the target reference sample cluster to which the target training sample belongs.
prediction(xi)=min[dist(xi,C1),dist(xi,C2),...,dist(xi,Ck)]Wherein, dist (x)i,Ck) Training a sample x for a targetiDistance to the kth target reference sample cluster, prediction (x)i) Characterization selection of distance target training samples x from each target reference sample clusteriAnd the selected target reference sample cluster can be regarded as the target reference sample cluster to which the target training sample belongs.
It should be noted that the distance between the center points of the target training sample and the target reference sample cluster may be regarded as the distance between the target training sample cluster and the target reference sample cluster.
In this way, for each target reference sample cluster, a target training sample cluster corresponding to the target reference sample cluster may be formed from all target training samples belonging to the target reference sample cluster.
S304, determining target training samples for model training from each target training sample cluster according to the proportion information.
The method for determining the target training samples for the long-period model training from each target training sample cluster according to the proportional information provided by the embodiment of the application can be as follows: respectively determining the sampling quantity of each target training sample cluster according to the quantity of the target training samples in each target training sample cluster according to the proportion information; and extracting the target training samples of the sampling quantity of the target training sample cluster from the target training sample cluster according to the sequence of the distance from the central point of the target training sample cluster from near to far.
Illustratively, determining the sampling number of each target training sample cluster according to the number of target training samples in each target training sample cluster according to the proportion information includes: determining the number of target training samples in each target training sample cluster; respectively determining the sampling quantity of each target training sample cluster according to the quantity of the target training samples in each target training sample cluster; the ratio between the sampling number of the first target training sample cluster and the sampling number of the second target training sample cluster is the same as the ratio between the number of the target reference samples in the target reference sample cluster corresponding to the first target training sample cluster and the number of the target reference samples in the target reference sample cluster corresponding to the second target training sample cluster represented by the ratio information.
The training sample distribution of the training sample set can be seen in fig. 6, and it can be seen that the four target training sample clusters obtained after clustering the target training samples of the training sample set are distributed more uniformly and have a larger difference from the OOT set. Since the ratio of each target reference sample cluster in the OOT set is not consistent with the ratio of the training samples allocated to each target training sample cluster, it is necessary to ensure that the ratio of the target training samples in different target training sample clusters in the training sample set is consistent with the ratio of the OOT set. The method is to sample from all target training samples in the training sample set according to the same proportion, and the sampling priority is from near to far from the center point of a target reference sample cluster.
Training sample cluster C with a target1For example, the target training sample cluster may be represented as:
Figure BDA0002783201150000111
wherein the content of the first and second substances,
Figure BDA0002783201150000112
can be considered as belonging to a target training sample cluster C1Target training sample xnAnd target training sample cluster C1Such that a cluster of samples C is trained from the target1When the target training sample is extracted, the target training sample cluster C is preferentially extracted1Mid-extraction distance target training sample cluster C1Training samples for targets at close distances.
When the number of the target training samples in a certain target training sample cluster is less than the required sampling number, the absolute value of the sampling number in other target training sample clusters is reduced, so that the proportion of the number of the target training samples extracted from each target training sample cluster is consistent with the OOT set. The sampled target training samples can be shown in fig. 7, and it can be seen from fig. 7 that the proportion of the number of target training samples in each target training sample cluster formed by the sampled target training samples is basically consistent with the OOT set.
It should be noted that there are other characteristic information determination models and unsupervised model building methods, which have high universality, all supervised classifications can be used to train characteristic information determination models, such as logistic regression, decision trees, random forests, etc., and unsupervised model clustering can use t-sne, dbscan, etc.
According to the embodiment of the application, after the target training sample for model training is determined, model training can be performed according to the determined target training sample to generate the model, and then the generated model is truly applied to prediction. If the model generated by model training according to the determined target training sample is the wind control model, the training sample determination method provided by the embodiment of the application can solve the problem that the performance of big data wind control modeling crowd migrates along with time, can measure the similarity degree of the training sample set and the reference sample set, and ensures the consistency of the real training set (namely, the set formed by the determined target training samples for model training) and the passenger group distribution when the model is applied by sampling, thereby solving the problem that the model effect is attenuated due to the time difference between the training model and the model, and simultaneously can reduce the number of the real training sets by sampling, thereby accelerating the model training speed.
Fig. 8 is a schematic structural diagram of a training sample determination apparatus according to an embodiment of the present application. As shown in fig. 8, the apparatus includes:
a target training sample determining unit 81, configured to determine a target training sample after dimension reduction processing of each training sample in a training sample set;
the target reference sample cluster determining unit 82 is configured to obtain at least one target reference sample cluster and proportion information, where the target reference sample after the dimensionality reduction processing of the reference samples in the reference sample set is performed is obtained through clustering, the proportion information represents a proportion of the number of target reference samples in a first reference sample cluster and a second reference sample cluster in the at least one target reference sample cluster, and the reference sample is later than the training sample;
a target training sample cluster determining unit 83, configured to perform clustering processing on all target training samples according to at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster;
and the sample sampling unit 84 is used for respectively determining target training samples for model training from each target training sample cluster according to the proportion information.
In this embodiment, preferably, the target training sample determining unit includes:
the training sample set determining unit is used for determining a training sample set formed by a plurality of training samples, and the training samples indicate a plurality of dimensional characteristics of a user;
the characteristic classification unit is used for carrying out characteristic classification on a plurality of dimensional characteristics indicated by the training samples to obtain at least one characteristic group indicated by the training samples, and different characteristic groups belong to different characteristic categories; one dimension feature belongs to only one feature group;
the characteristic information determining unit is used for inputting the characteristic group into a pre-trained characteristic information determining model corresponding to the characteristic category to which the characteristic group belongs to obtain characteristic information used for representing the characteristic group;
and the feature information of at least one feature group indicated by the training sample forms a target training sample after the dimension reduction processing of the training sample.
The training sample determination device provided in the embodiment of the present application further includes a feature information determination model generation unit, where the feature information determination model generation unit includes:
a first sample determination unit for determining a first sample for training the feature information determination model;
the characteristic group determining unit is used for carrying out characteristic classification on the plurality of dimensional characteristics indicated by the first sample to obtain at least one characteristic group indicated by the first sample;
a second sample determination unit, configured to generate a second sample according to the feature group indicated by the first sample and belonging to the target feature class;
and the model training unit is used for training the characteristic information determination model to be trained by utilizing the second sample to generate a characteristic information determination model corresponding to the target characteristic category.
In this embodiment, preferably, the target reference sample cluster determining unit includes:
the target reference sample determining unit is used for determining a target reference sample after dimension reduction processing of each reference sample in the reference sample set;
the clustering unit is used for clustering all the target reference samples to obtain at least one target reference sample cluster;
and the generating unit is used for generating the proportion information according to the number of the target reference samples in each target reference sample cluster.
In this embodiment of the application, preferably, the target training sample cluster determining unit includes:
the calculating unit is used for calculating the distance between the target training sample and the central point of each target reference sample cluster in the at least one target reference sample cluster;
the first determining unit is used for determining a target reference sample cluster to which the target training sample belongs according to the distance between the target training sample and the central point of each target reference sample cluster in at least one target reference sample cluster;
and the second determining unit is used for determining all target training samples belonging to the same target reference sample cluster as a target training sample cluster corresponding to the target reference sample cluster.
In this embodiment, preferably, the sample sampling unit includes:
the third determining unit is used for respectively determining the sampling quantity of each target training sample cluster according to the quantity of the target training samples in each target training sample cluster according to the proportion information;
and the sampling unit is used for extracting the target training samples of the sampling number of the target training sample cluster from the target training sample cluster according to the sequence of the distance from the central point of the target training sample cluster from near to far.
In this embodiment of the application, preferably, the third determining unit includes:
the first determining subunit is used for determining the number of the target training samples in each target training sample cluster;
the second determining subunit is used for respectively determining the sampling quantity of each target training sample cluster according to the quantity of the target training samples in each target training sample cluster;
the ratio between the sampling number of the first target training sample cluster and the sampling number of the second target training sample cluster is the same as the ratio between the number of the target reference samples in the target reference sample cluster corresponding to the first target training sample cluster and the number of the target reference samples in the target reference sample cluster corresponding to the second target training sample cluster represented by the ratio information.
As shown in fig. 9, a block diagram of an implementation manner of a computer device provided in an embodiment of the present application is shown, where the computer device includes:
a memory 901 for storing a program;
a processor 902 for executing a program, the program specifically for:
determining a target training sample after dimension reduction processing of each training sample in a training sample set;
acquiring at least one target reference sample cluster and proportion information obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in a reference sample set, wherein the proportion information represents the proportion of the number of the target reference samples in a first reference sample cluster and a second reference sample cluster in the at least one target reference sample cluster, and the reference samples are later than training samples;
clustering all target training samples according to at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster respectively;
and respectively determining target training samples for model training from each target training sample cluster according to the proportion information.
The processor 902 may be a central processing unit CPU or an Application Specific Integrated Circuit ASIC (Application Specific Integrated Circuit).
The control device may further comprise a communication interface 903 and a communication bus 904, wherein the memory 901, the processor 902 and the communication interface 903 are in communication with each other via the communication bus 904.
The embodiment of the present application further provides a readable storage medium, where a computer program is stored, and the computer program is loaded and executed by a processor to implement each step of the training sample determination method, where a specific implementation process may refer to descriptions of corresponding parts in the foregoing embodiment, and details are not repeated in this embodiment.
The embodiment of the application provides a training sample determination method, a training sample determination device, computer equipment and a storage medium, wherein a target training sample subjected to dimension reduction processing of each training sample in a training sample set is determined; acquiring at least one target reference sample cluster and proportion information obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in the reference sample set, wherein the reference samples are later than the training samples; clustering all target training samples according to at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster respectively; and then confirm the way used for training sample of goal of the model from every training sample cluster of goal separately according to the proportion information, have realized the measurement to the similar degree of training sample set and reference sample set, and through sampling the training sample of goal used for training of the model from training the sample set, the customer group distribution consistency while guaranteeing model training and model to apply, thus has solved the situation that causes the model effect to attenuate because of training the time difference of model and using the model, can reduce the quantity of the real training set through sampling and thus accelerate the training speed of the model at the same time.
The training sample determination method, the training sample determination device, the computer device and the storage medium provided by the present invention are described in detail above, and a specific example is applied in the present document to illustrate the principle and the implementation manner of the present invention, and the description of the above embodiment is only used to help understanding the method and the core idea of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.
It should be noted that, in the present specification, the embodiments are all described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments may be referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
It is further noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include or include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims (10)

1. A method for determining training samples, comprising:
determining a target training sample after dimension reduction processing of each training sample in a training sample set;
acquiring at least one target reference sample cluster and proportion information, wherein the target reference sample cluster is obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in a reference sample set, the proportion information represents the proportion of the number of target reference samples in a first reference sample cluster and a second reference sample cluster in the at least one target reference sample cluster, and the reference samples are later than the training samples;
clustering all the target training samples according to the at least one target reference sample cluster to obtain target training sample clusters corresponding to the target reference sample clusters respectively;
and respectively determining target training samples for model training from each target training sample cluster according to the proportion information.
2. The method of claim 1, wherein the determining the target training sample after the dimension reduction processing of each training sample in the training sample set comprises:
determining a training sample set composed of a plurality of training samples, the training samples being indicative of a plurality of dimensional features of a user;
performing feature classification on the multiple dimensional features indicated by the training sample to obtain at least one feature group indicated by the training sample, wherein different feature groups belong to different feature classes; one dimension feature belongs to only one feature group;
inputting the feature group into a pre-trained feature information determination model corresponding to the feature class to which the feature group belongs to obtain feature information for representing the feature group;
and the feature information of at least one feature group indicated by the training sample constitutes the target training sample after the dimension reduction processing of the training sample.
3. The method according to claim 2, wherein the generating of the feature information determination model corresponding to the target feature class includes:
determining a first sample for training the feature information determination model;
performing feature classification on a plurality of dimensional features of the first sample indication to obtain at least one feature group of the first sample indication;
generating a second sample from the group of features indicated by the first sample that belong to the target feature class;
and training the characteristic information determination model to be trained by using the second sample to generate a characteristic information determination model corresponding to the target characteristic category.
4. The method according to claim 1, wherein the obtaining of at least one target reference sample cluster and proportion information obtained by clustering target reference samples subjected to the dimensionality reduction processing on the reference samples in the reference sample set comprises:
determining a target reference sample after dimension reduction processing of each reference sample in a reference sample set;
clustering all the target reference samples to obtain at least one target reference sample cluster;
and generating proportion information according to the number of the target reference samples in each target reference sample cluster.
5. The method according to claim 1, wherein the clustering all the target training samples according to the at least one target reference sample cluster to obtain a target training sample cluster corresponding to each target reference sample cluster respectively comprises:
calculating the distance between the target training sample and the central point of each target reference sample cluster in the at least one target reference sample cluster respectively;
determining a target reference sample cluster to which the target training sample belongs according to the distance between the target training sample and the central point of each target reference sample cluster in the at least one target reference sample cluster;
and determining all target training samples belonging to the same target reference sample cluster as a target training sample cluster corresponding to the target reference sample cluster.
6. The method of claim 5, wherein the determining target training samples for model training from each of the target training sample clusters according to the scale information comprises:
respectively determining the sampling quantity of each target training sample cluster according to the quantity of the target training samples in each target training sample cluster according to the proportion information;
and extracting the target training samples of the sampling number of the target training sample cluster from the target training sample cluster according to the sequence of the distance from the central point of the target training sample cluster from near to far.
7. The method according to claim 6, wherein the determining the number of samples of each of the target training sample clusters according to the number of target training samples in each of the target training sample clusters based on the ratio information comprises:
determining the number of target training samples in each target training sample cluster;
respectively determining the sampling quantity of each target training sample cluster according to the quantity of target training samples in each target training sample cluster;
the ratio between the sampling number of the first target training sample cluster and the sampling number of the second target training sample cluster is the same as the ratio between the number of the target reference samples in the target reference sample cluster corresponding to the first target training sample cluster and the number of the target reference samples in the target reference sample cluster corresponding to the second target training sample cluster represented by the ratio information.
8. A training sample determination apparatus, comprising:
the target training sample determining unit is used for determining a target training sample after dimension reduction processing of each training sample in the training sample set;
the target reference sample cluster determining unit is used for acquiring at least one target reference sample cluster and proportion information, wherein the target reference sample cluster is obtained by clustering target reference samples subjected to dimensionality reduction processing on reference samples in a reference sample set, the proportion information represents the proportion of the number of target reference samples in a first reference sample cluster and a second reference sample cluster in the at least one target reference sample cluster, and the reference samples are later than the training samples;
a target training sample cluster determining unit, configured to perform clustering processing on all the target training samples according to the at least one target reference sample cluster to obtain target training sample clusters corresponding to each target reference sample cluster;
and the sample sampling unit is used for respectively determining target training samples for model training from each target training sample cluster according to the proportion information.
9. A computer device, comprising: the system comprises a processor and a memory, wherein the processor and the memory are connected through a communication bus; the processor is used for calling and executing the program stored in the memory; the memory for storing a program for implementing the training sample determination method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which, when being loaded and executed by a processor, carries out the steps of the training sample determination method according to any one of claims 1 to 7.
CN202011288666.3A 2020-11-17 2020-11-17 Training sample determination method and device, computer equipment and storage medium Pending CN112200271A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011288666.3A CN112200271A (en) 2020-11-17 2020-11-17 Training sample determination method and device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011288666.3A CN112200271A (en) 2020-11-17 2020-11-17 Training sample determination method and device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN112200271A true CN112200271A (en) 2021-01-08

Family

ID=74033620

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011288666.3A Pending CN112200271A (en) 2020-11-17 2020-11-17 Training sample determination method and device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN112200271A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418752A (en) * 2022-03-28 2022-04-29 北京芯盾时代科技有限公司 Method and device for processing user data without type label, electronic equipment and medium
CN116821724A (en) * 2023-08-22 2023-09-29 腾讯科技(深圳)有限公司 Multimedia processing network generation method, multimedia processing method and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114418752A (en) * 2022-03-28 2022-04-29 北京芯盾时代科技有限公司 Method and device for processing user data without type label, electronic equipment and medium
CN116821724A (en) * 2023-08-22 2023-09-29 腾讯科技(深圳)有限公司 Multimedia processing network generation method, multimedia processing method and device
CN116821724B (en) * 2023-08-22 2023-12-12 腾讯科技(深圳)有限公司 Multimedia processing network generation method, multimedia processing method and device

Similar Documents

Publication Publication Date Title
CN110009171B (en) User behavior simulation method, device, equipment and computer readable storage medium
CN110111113B (en) Abnormal transaction node detection method and device
CN102541736B (en) Acceleration test method in software reliability execution process
CN111738351A (en) Model training method and device, storage medium and electronic equipment
CN111796957B (en) Transaction abnormal root cause analysis method and system based on application log
CN112200271A (en) Training sample determination method and device, computer equipment and storage medium
CN112215696A (en) Personal credit evaluation and interpretation method, device, equipment and storage medium based on time sequence attribution analysis
CN114139931A (en) Enterprise data evaluation method and device, computer equipment and storage medium
CN114169460A (en) Sample screening method, sample screening device, computer equipment and storage medium
CN112257332B (en) Simulation model evaluation method and device
CN111325255B (en) Specific crowd delineating method and device, electronic equipment and storage medium
CN113988226B (en) Data desensitization validity verification method and device, computer equipment and storage medium
CN115660101A (en) Data service providing method and device based on service node information
CN115630708A (en) Model updating method and device, electronic equipment, storage medium and product
CN114281664A (en) Application program load data prediction method, device and storage medium
CN111859057B (en) Data feature processing method and data feature processing device
CN114329966A (en) Method and system for evaluating health degree of remote control system of natural gas pipeline
CN114298460A (en) Material work order assignment processing method, device, equipment and storage medium
CN110968690B (en) Clustering division method and device for words, equipment and storage medium
CN113298641A (en) Integrity degree cognition method and device
CN111654853A (en) Data analysis method based on user information
del Castillo et al. Fitting Tails by the Empirical Residual Coefficient of Variation: The ercv Package.
CN110913033A (en) IDCIP address allocation method based on CNN convolutional neural network learning
CN114330924B (en) Complex product change strength prediction method based on generating type countermeasure network
CN113515383B (en) System resource data distribution method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: B7-7-2, Yuxing Plaza, No.5, Huangyang Road, Yubei District, Chongqing

Applicant after: Chongqing duxiaoman Youyang Technology Co.,Ltd.

Address before: 201800 room j1328, 3 / F, building 8, 55 Huiyuan Road, Jiading District, Shanghai

Applicant before: SHANGHAI YOUYANG NEW MEDIA INFORMATION TECHNOLOGY Co.,Ltd.

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20211220

Address after: 100193 Room 606, 6 / F, building 4, West District, courtyard 10, northwest Wangdong Road, Haidian District, Beijing

Applicant after: Du Xiaoman Technology (Beijing) Co.,Ltd.

Address before: B7-7-2, Yuxing Plaza, No.5, Huangyang Road, Yubei District, Chongqing

Applicant before: Chongqing duxiaoman Youyang Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210108