CN111325291A

CN111325291A - Entity object classification method for selectively integrating heterogeneous models and related equipment

Info

Publication number: CN111325291A
Application number: CN202010409750.XA
Authority: CN
Inventors: 张雅淋
Original assignee: Alipay Hangzhou Information Technology Co Ltd
Current assignee: Alipay Hangzhou Information Technology Co Ltd
Priority date: 2020-05-15
Filing date: 2020-05-15
Publication date: 2020-06-23
Anticipated expiration: 2040-05-15
Also published as: CN111325291B

Abstract

The entity object classification system for selectively integrating heterogeneous models provided by one or more embodiments of the present specification provides a solution for the selective integration of heterogeneous models, and includes heterogeneous base classifiers in ensemble learning, each type of base classifier gives different parameter combinations to learn in a learning stage to obtain a plurality of models, and in a selection stage, one or more of the models are selected as a component of a final model. By the method, the characteristics of different models which are long can be fully utilized, complementation is achieved, the robustness and effectiveness of the whole model are improved, and entity object classification can be well completed.

Description

Entity object classification method for selectively integrating heterogeneous models and related equipment

Technical Field

The present disclosure relates to the field of data processing technologies, and in particular, to an entity object classification method and related devices for selectively integrating heterogeneous models.

Background

In the application scene of the internet, a large amount of data needs to be analyzed every day, and machine learning is playing a role in more and more scenes as a technical means. For a given task, ensemble learning is often a good choice for achieving good deployment results, and it is often feasible to improve the overall generalization performance by integrating multiple different models.

However, the conventional model integration is based on the trained base classifier to average to obtain the final prediction result, which often fails to achieve a good effect, and has the problems of large storage overhead and long prediction time. Correspondingly, selective integration is a way to alleviate this problem, and by selecting and reasonably combining all candidate models, a better overall effect can often be achieved, and model storage overhead and prediction time overhead can be greatly reduced. Therefore, there is a need to provide a faster or more reliable model integration scheme.

Disclosure of Invention

In view of the above, an object of one or more embodiments of the present disclosure is to provide a method and related apparatus for entity object classification with selective integration of heterogeneous models, so as to solve the above problems.

In view of the above, one or more embodiments of the present specification provide an entity object classification method for selectively integrating heterogeneous models, including:

acquiring a training data set and a verification data set; the training dataset and the validation dataset comprise entity object data;

training to obtain at least two groups of heterogeneous base classifiers by using the training data set;

and circularly executing the following steps of generating and grading the combination of the base classifiers according to the specified number of rounds:

generating a plurality of base classifier combinations; each base classifier combination is obtained by assigning a weight to each base classifier through an evolutionary algorithm by using the previous round of base classifier combination and the score of the base classifier combination in combination with the weight assigned to the base classifier included in the previous round of base classifier combination, and selecting at least one base classifier from each group of base classifiers according to the sequence of weights from large to small; in the first round, a weight is given to each base classifier in a mode of randomly generating the weight;

predicting the data in the verification data set by using the base classifier combination and the weight given to the base classifier included in the base classifier combination, and calculating the score of each base classifier combination according to the prediction result;

and determining the highest-grade base classifier combination in all the rounds, and obtaining the selective integration heterogeneous model by combining the weight values corresponding to the base classifiers based on the highest-grade base classifier combination for carrying out entity object classification prediction.

One or more embodiments of the present specification further provide an entity object classification apparatus selectively integrating heterogeneous models, including:

an acquisition module for acquiring a training data set and a verification data set; the training dataset and the validation dataset comprise entity object data;

the training module is used for training to obtain at least two groups of heterogeneous base classifiers by utilizing the training data set;

the base classifier combination generation and scoring module is used for circularly executing the following steps of generating and scoring the base classifier combination according to the specified number of rounds:

generating a plurality of base classifier combinations; each base classifier combination is obtained by assigning a weight to each base classifier through an evolutionary algorithm by using the previous round of base classifier combination and the score of the base classifier combination in combination with the weight assigned to the base classifier included in the previous round of base classifier combination, and selecting at least one base classifier from each group of base classifiers according to the sequence of weights from large to small;

and the classification module is used for determining the highest-grade base classifier combination in all the rounds, and obtaining the selective integration heterogeneous model by combining the weight values corresponding to the base classifiers based on the highest-grade base classifier combination for carrying out entity object classification prediction.

One or more embodiments of the present specification also provide an electronic device comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the method when executing the program.

One or more embodiments of the present specification also provide a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method.

From the above description, it can be seen that, in the entity object classification method for selectively integrating heterogeneous models and the related apparatus provided in one or more embodiments of the present specification, a solution for selectively integrating heterogeneous models is proposed, in which heterogeneous base classifiers are included in ensemble learning, each type of base classifier is given different parameter combinations to learn in a learning stage to obtain multiple models, and in a selection stage, one or more of the models are selected as a component of a final model for each type of model. By the method, the characteristics of different models which are long can be fully utilized, complementation is achieved, the robustness and effectiveness of the whole model are improved, and entity object classification can be completed better.

Drawings

In order to more clearly illustrate one or more embodiments or prior art solutions of the present specification, the drawings that are needed in the description of the embodiments or prior art will be briefly described below, and it is obvious that the drawings in the following description are only one or more embodiments of the present specification, and that other drawings may be obtained by those skilled in the art without inventive effort from these drawings.

FIG. 1 is a schematic diagram of a solid object classification system for selectively integrating heterogeneous models provided in one or more embodiments of the present description;

FIG. 2 is a flowchart of a method for entity object classification of selectively integrated heterogeneous models according to one or more embodiments of the present disclosure;

FIG. 3 is another schematic flow diagram of a method for entity object classification for selectively integrating heterogeneous models according to one or more embodiments of the present disclosure;

FIG. 4 is a block diagram of an entity object classification apparatus selectively integrating heterogeneous models according to one or more embodiments of the present disclosure;

fig. 5 is a schematic diagram of a hardware structure of an electronic device according to one or more embodiments of the present disclosure.

Detailed Description

For the purpose of promoting a better understanding of the objects, aspects and advantages of the present disclosure, reference is made to the following detailed description taken in conjunction with the accompanying drawings.

It is to be noted that unless otherwise defined, technical or scientific terms used in one or more embodiments of the present specification should have the ordinary meaning as understood by those of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in one or more embodiments of the specification is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect.

And (3) supervision and learning: one area of research in machine learning, given data comprising a large number of labeled samples, builds models based on such training data to predict test samples. Wherein the samples are represented as feature vectors describing their features, and all samples are labeled samples with labeling information (e.g., labeled as positive or negative) representing their attributes.

Integrated learning: one area of research in machine learning is to combine multiple base learners in an attempt to achieve superior generalization performance over a single learner.

Homogeneous model: when a plurality of base classifiers in ensemble learning belong to the same type of classifier (such as neural network models), the models are said to be homogeneous.

Heterogeneous model: when a plurality of base classifiers in ensemble learning belong to different classifiers (such as a support vector machine, a neural network, a random forest and the like), the models are called to be heterogeneous.

As an embodiment of ensemble learning, a plurality of homogeneous base classifiers (e.g., 5) may be trained based on the same learning algorithm (e.g., a neural network), and an average of prediction results of the plurality of homogeneous base classifiers is used as a final prediction result of the model. However, integration schemes based on homogeneous models, limited by the characteristics of the model itself, may not be advantageous at certain tasks. Meanwhile, simply averaging the prediction results of various models without screening the models may result in undesirable overall model effects due to poor effects of some individual models.

FIG. 1 illustrates a schematic diagram of an entity object classification system for selectively integrating heterogeneous models provided by one or more embodiments of the present specification.

As shown in fig. 1, the entity object classification system selectively integrating heterogeneous models obtains at least two sets of heterogeneous base classifiers by training based on different learning algorithms using training data in a training data set; wherein the heterogeneity may mean that at least one of the at least two groups of base classifiers has a different type from the other groups of base classifiers; that is, there are heterogeneous basis classifiers in the at least two sets of basis classifiers. For example, if three groups of base classifiers are obtained through training, wherein at least one group of base classifiers (e.g., neural network) has a different type from the other two groups of base classifiers (e.g., decision tree), so that in the finally obtained target classification model (selectively integrated heterogeneous model), the different types of base classifiers can exert their own characteristics, so that the target classification model as a whole can be suitable for more application scenarios.

For example, it is assumed that three groups of base classifiers (the types of the base classifiers in the same group of base classifiers are the same) are respectively trained based on three learning algorithms (e.g., support vector machine, neural network, random forest), and the groups in the three groups of base classifiers are heterogeneous to each other. Therefore, the target classification model obtained after final selective integration can have the characteristics of three types of base classifiers, and is suitable for more scenes.

In one or more embodiments of the present description, after at least two groups of base classifiers are obtained through training, the following steps of generating a base classifier combination and scoring the base classifier combination are performed according to a specified number of rounds:

and predicting the data in the verification data set by using the base classifier combination and the weight given to the base classifier included in the base classifier combination, and calculating the score of each base classifier combination according to the prediction result.

The number of the specified wheels is set according to the requirement, and for example, the number of the specified wheels can be 10, 15, 20, and the like.

The base classifier combination is obtained by selecting a certain number of base classifiers from each group of base classifiers according to a certain rule and combining the base classifiers. For example, for a combination of base classifiers, the generation process may include: firstly, each base classifier in at least two groups of heterogeneous base classifiers obtained by training is endowed with a weight value, and the way of endowing the weight value can be that the weight value is endowed to each base classifier in at least two groups of heterogeneous base classifiers obtained by training through an evolutionary algorithm by utilizing the previous round of base classifier combination and the score of the base classifier combination endowed by the base classifiers included in the previous round of base classifier combination; for the combination of the base classifiers obtained in the first round, the weights of the base classifiers are obtained in a random generation mode. And secondly, selecting at least one base classifier from each group of base classifiers based on the sequence of the weights of the base classifiers from large to small. For example, for a group of base classifiers, it is assumed that 4 base classifiers are included, and the weights assigned to the base classifiers are respectively 0.1, 0.2, 0.3, and 0.4, so that a base classifier with a weight of 0.4 is selected if one base classifier needs to be selected from the group of base classifiers in the descending order of the weights, and two base classifiers need to be selected from the group of base classifiers if weights corresponding to 0.3 and 0.4 are selected, and so on. And finally, collecting the base classifiers selected from each group of base classifiers into the base classifier combination.

In one or more embodiments of the present description, when generating a plurality of combinations of basis classifiers, the weights assigned to the basis classifiers in each combination of basis classifiers may be different. For example, when weights are assigned to the base classifiers (using an evolutionary algorithm or randomly generated), the generated weights are a plurality of groups, for example, 10 groups of weights, and for each group of weights, a base classifier combination is correspondingly generated according to the foregoing method, so that for the plurality of groups of weights, a plurality of base classifier combinations are finally obtained correspondingly, and in the base classifier combinations, the corresponding selected base classifiers may be different, and the assigned weights of the selected base classifiers may also be different.

In one or more embodiments of the present disclosure, after a plurality of base classifier combinations are generated, for each base classifier combination, a weight value given to a base classifier included in the base classifier combination is combined (at this time, the base classifier combination with the weight value already corresponds to a selective integration heterogeneous model), data in the verification data set may be predicted, and a score of each base classifier combination may be calculated according to a prediction result. After the score of each base classifier combination is obtained, when the generation and scoring steps of the next round of base classifier combination are carried out, the previous round of base classifier combination and the score given by the base classifiers included in the previous round are used for combining the score of the base classifier combination, the weight is given to each base classifier again through an evolutionary algorithm (the weight at this time can be a plurality of groups), then at least one base classifier is selected from each group of base classifiers according to the sequence of the weights from large to small to obtain a new base classifier combination, and the new base classifier combination can be predicted by using the data in the verification data set again to obtain the score. And circularly executing the generation and grading steps of the base classifier combination until the specified number of rounds is reached.

Finally, the base classifier combinations in all the rounds respectively have a corresponding score, wherein the base classifier combination with the highest score is combined with the weight values corresponding to the base classifiers included in the base classifier combinations to obtain a final selective integration heterogeneous model, and classification prediction can be carried out by utilizing the model.

In one or more embodiments of the present description, the aforementioned weights assigned to the base classifiers may be regarded as weight vectors. For example, the entity object classification system of the selective ensemble heterogeneous model may determine a first predetermined number (e.g., 10) of weight vectors (including the weight assigned to each base classifier) of the at least two sets of base classifiers according to some method adopted by some combination strategies of ensemble learning. Here, the weight vector refers to a vector obtained by combining the weights of all the base classifiers in the at least two groups of base classifiers. For a set of basis classifiers, the vectors corresponding to the set of basis classifiers may be referred to as sub-weight vectors. For example, if three sets of basis classifiers are obtained by training, the first set of basis classifiers corresponds to the first sub-weight vector, the second set of basis classifiers corresponds to the second sub-weight vector, and the third set of basis classifiers corresponds to the third sub-weight vector, the weight vector is a combination of the first sub-weight vector, the second sub-weight vector, and the third sub-weight vector. Here, more than one weight vector may be determined, and a first predetermined number of weight vectors, for example, 10 sets of weight vectors, may be determined according to a preset first predetermined number. Optionally, in the first round, the first predetermined number of weight vectors is obtained by means of random generation.

After obtaining the first predetermined number of weight vectors, the entity object classification system for selectively integrating heterogeneous models needs to selectively integrate the basis classifiers in the at least two sets of basis classifiers according to the weight vectors. Specifically, for each of the weight vectors, a second predetermined number (e.g., 1) of weights is selected according to the value (weight) of each weight in each sub-weight vector, and the values of the remaining weights in the sub-weight vector are set to 0, so as to obtain a first predetermined number of corrected weight vectors. Here, the second predetermined number is a value that is taken as needed. For example, assuming that the second predetermined number here is 1, one base classifier is selected from each group of base classifiers; for example, in this step, the selecting method may be to select the weight with the largest weight value in the sub-weight vector, and then set the values of the remaining weights in the sub-weight vector to 0 (here, the weight being set to zero may be understood as that the corresponding base classifier is not selected, and the base classifier corresponding to the weight that is not set to zero is the selected base classifier); these corrected sub-weight vectors are then combined into a corrected weight vector. For each group of weight vectors, the step is adopted to perform processing, so as to obtain a first predetermined number (for example, 10 groups) of corrected weight vectors, and the base classifier with the weight value set to zero in the corrected weight vectors can be understood as that the base classifier is not selected, and the base classifier corresponding to the weight value without being set to zero in the corrected weight vectors is the selected base classifier.

Then, the entity object classification system selectively integrates heterogeneous models predicts the data in the verification data set by using the first predetermined number of correction weight vectors and combining the at least two groups of base classifiers, and calculates the score of each correction weight vector according to the prediction result. Alternatively, the scoring method herein may employ a model performance method commonly used in machine learning. For example, a Receiver operating characteristics method (ROC) or an Area Under ROC Curve calculation method (AUC). Optionally, in this step, the correction weight vectors used for prediction may be normalized first and then used for prediction, so that the performance of each correction weight vector is more comparable.

Then, the entity object classification system of the selective integration heterogeneous model may regenerate a first predetermined number of new weight vectors by using an evolutionary algorithm in combination with the correction weight vectors and scores thereof, and repeat the steps from the calculation of the correction weight vectors to the calculation of scores of the correction weight vectors by using the regenerated weight vectors to obtain a new round of scores of the correction weight vectors; and repeating the previous steps until the number of the specified rounds is reached, and finally obtaining each correction weight vector and the score thereof in all rounds.

After the step of scoring for the specified number of rounds is completed, the correction weight vector with the highest score in the scores of each correction weight vector obtained in all rounds is determined, and a second preset number of base classifiers is selected from each group of base classifiers based on the correction weight vector with the highest score.

And finally, the entity object classification system of the selective integration heterogeneous model combines the selected base classifiers according to the corresponding correction weight vectors to obtain a target classification model (the selective integration heterogeneous model) for classification prediction. Here, the target classification model includes the selected base classifiers and their weights, and when performing classification prediction on data, only the weighted average of the classification prediction results of the base classifiers is needed to obtain the final classification prediction result.

Optionally, the evolutionary algorithm employs at least one of a genetic algorithm, a genetic plan, an evolutionary strategy, and an evolutionary plan.

In the entity object classification system for selectively integrating heterogeneous models, which is provided in one or more embodiments of the present specification, a solution for selectively integrating heterogeneous models is proposed, in which heterogeneous basis classifiers (such as support vector machines, neural networks, random forests, gradient descent decision trees (GBDTs), etc.) are included in ensemble learning, each type of basis classifier is given different parameter combinations to learn in a learning stage to obtain multiple models, and in a selection stage, one or more of the models are selected as a component of a final model. By the mode, the characteristics of different models which are long are fully utilized, complementation is achieved, and the robustness and effectiveness of the whole model are improved. And by training a plurality of models for each type of base classifier, the best effect of each model under the synchronous parameters can be fully explored, and the overall performance is further improved.

In one or more embodiments of the present description, the entity object classification system that selectively integrates heterogeneous models can be used to classify various entity objects. The entity object in one or more embodiments of the present specification may be, for example, any one of a user, a device, or an account of a user (which may also be referred to simply as an account).

For example, for a user, user properties (e.g., legal or illegal), user status (e.g., risky or non-risky), and the like may be classified. Similarly, account properties (e.g., legal or illegal), account status (e.g., risky or no risk), etc. may also be classified for the user's account, and device properties (e.g., legal or illegal), device status (e.g., risky or no risk), etc. may also be classified for the device.

In one or more embodiments of the present description, the entity object classification system for selectively integrating heterogeneous models can be used to classify user properties (e.g., classify users as legitimate users or illegitimate users); the training data set and the verification data set comprise at least one of user basic information, user dynamic information and user relation information; the basic information of the user comprises at least one of gender, age and academic calendar, the dynamic information of the user comprises at least one of browsing record and consumption record of the user in a preset period, the user relationship information comprises at least one of friend number and basic information of friends, and the basic information of friends of the user comprises at least one of gender, age and academic calendar of the friends.

It can be seen that in the training data, there are different types of features in the user basic information, the user dynamic information and the user relation information, for example, for data such as age information, consumption information, and the like, there are usually continuous features, for data such as gender, academic calendar, and the like, there are usually discrete features, and for different types of features, it is more suitable for different base classifiers. For example, the continuity features are more suitably trained using tree models (e.g., GBDT, random forest), while the discreteness features are more suitably trained using neural network models. Therefore, aiming at different types of features in the training data, when the heterogeneous models are selectively integrated, the finally obtained target classification model can better complete the task.

Similarly, for the account and the device, the training data set and the verification data set may be obtained by collecting the account/device basic information, the account/device dynamic information, and the account/device relationship information, which are not described herein again.

Fig. 2 is a flowchart illustrating an entity object classification method for selectively integrating heterogeneous models according to one or more embodiments of the present disclosure.

As shown in fig. 2, the entity object classification method for selectively integrating heterogeneous models includes:

step 102: a training dataset and a validation dataset are acquired.

Optionally, the data in the training data set and the validation data set are both provided with classification labels. For example, if the entity object classification method of the selective integration heterogeneous model is used for classifying user properties, the classification labels are user property labels, such as legal users or illegal users.

Step 104: training to obtain at least two groups of heterogeneous base classifiers by using the training data set; wherein at least one of the at least two sets of base classifiers has a type different from the other sets of base classifiers; that is, there are heterogeneous basis classifiers in the at least two sets of basis classifiers.

Optionally, the base classifier comprises at least one of a logistic regression model, a support vector machine model, a decision tree model, a gradient descent decision tree model, a random forest model, a neural network model.

For example, if three groups of base classifiers are obtained through training, wherein at least one group of base classifiers (for example, neural networks) has a different type from the other two groups of base classifiers (for example, decision trees), so that in the finally obtained target classification model, the different types of base classifiers can exert their own characteristics, so that the target classification model as a whole can be suitable for more application scenarios.

Optionally, each of the at least two groups of base classifiers is of a different type from the base classifiers of the other groups.

For example, it is assumed that three groups of base classifiers (the types of the base classifiers in the same group are the same) are respectively trained based on three learning algorithms (e.g., support vector machine, neural network, random forest), and the groups in the three groups of base classifiers are heterogeneous to each other. Therefore, the target classification model obtained after final selective integration can have the characteristics of three base classifiers, and is suitable for more scenes.

The following generation 106 and scoring 108 steps of the base classifier combination are performed in a loop according to the specified number of rounds. The specified number of rounds is set as required, and may be, for example, 10 rounds, 15 rounds, 20 rounds, and the like.

Step 106: several combinations of base classifiers are generated.

In this step, each base classifier combination is obtained by assigning a weight to each base classifier through an evolutionary algorithm by using a previous round of base classifier combination and a score assigned to the base classifier combination by the base classifier included in the previous round of base classifier combination, and selecting at least one base classifier from each group of base classifiers according to the order of the weights from large to small.

In one or more embodiments of the present description, the aforementioned weights assigned to the base classifiers may be regarded as weight vectors. Alternatively, for example, a first predetermined number (e.g., 10) of weight vectors (including the assigned weight value of each base classifier) for the at least two sets of base classifiers may be determined according to some method employed in connection with the strategy of ensemble learning.

In this step, the weight vector is a vector obtained by combining the weights of all the base classifiers in the at least two groups of base classifiers. For a set of basis classifiers, the vectors corresponding to the set of basis classifiers may be referred to as sub-weight vectors. For example, if three sets of basis classifiers are obtained by training, the first set of basis classifiers corresponds to the first sub-weight vector, the second set of basis classifiers corresponds to the second sub-weight vector, and the third set of basis classifiers corresponds to the third sub-weight vector, the weight vector is a combination of the first sub-weight vector, the second sub-weight vector, and the third sub-weight vector. Here, more than one weight vector may be determined, and a first predetermined number of weight vectors, for example, 10 sets of weight vectors, may be determined according to a preset first predetermined number. Optionally, in the first round, the first predetermined number of weight vectors is obtained by means of random generation.

Step 108: and predicting the data in the verification data set by using the base classifier combination and the weight given to the base classifier included in the base classifier combination, and calculating the score of each base classifier combination according to the prediction result.

Alternatively, the scoring method herein may employ a model performance method commonly used in machine learning. For example, a Receiver Operating characterization method (ROC) or an Area Under ROC Curve calculation method (AUC).

In one or more embodiments of the present specification, when the weight given to the base classifier is regarded as a weight vector, the data in the verification data set may be predicted by using the first predetermined number of correction weight vectors and combining the at least two groups of base classifiers, and a score of each correction weight vector may be calculated according to a prediction result. Optionally, in this step, the correction weight vectors used for prediction may be normalized first and then used for prediction, so that the performance of each correction weight vector is more comparable.

After

steps

106 and 108 are performed until the number of rounds is specified, the following steps may be performed.

Step 110: and determining the highest-grade base classifier combination in all the rounds, and combining the weights corresponding to the base classifiers included in the highest-grade base classifier combination to obtain the selective integration heterogeneous model for classification prediction.

In this step, the base classifier combinations in all the rounds have a corresponding score, wherein the base classifier combination with the highest score is combined with the weight values corresponding to the base classifiers included in the base classifier combinations to obtain the final selective integration heterogeneous model, and classification prediction can be performed by using the model.

In this step, the selective integration heterogeneous model includes the selected base classifier and its weight (the weight is the weight assigned to the base classifier when the generated base classifier is combined), and when performing classification prediction on data, only the classification prediction results of each base classifier need to be weighted and averaged to obtain the final classification prediction result.

In the entity object classification method for selectively integrating heterogeneous models, provided by one or more embodiments of the present specification, a solution for selectively integrating heterogeneous models is proposed, in which a heterogeneous base classifier is included in ensemble learning, each type of base classifier is subjected to different parameter combination learning in a learning stage to obtain a plurality of models, and in a selection stage, one or more of the models are selected as a component of a final model. By the mode, the characteristics of different models which are long are fully utilized, complementation is achieved, and the robustness and effectiveness of the whole model are improved. And by training a plurality of models for each type of base classifier, the best effect of each model under the synchronous parameters can be fully explored, and the overall performance is further improved.

In one or more embodiments of the present description, the entity object classification method of the selectively integrated heterogeneous model may be used to classify various entity objects. The entity object in one or more embodiments of the present specification may be, for example, any one of a user, a device, or an account of a user (which may also be referred to simply as an account).

In one or more embodiments of the present description, the entity object classification method of the selective integration heterogeneous model can be used to classify the user properties (e.g., classify the user as a legitimate user or an illegitimate user); the training data set and the verification data set comprise at least one of user basic information, user dynamic information and user relation information; the basic information of the user comprises at least one of gender, age and academic calendar, the dynamic information of the user comprises at least one of browsing record and consumption record of the user in a preset period, the user relationship information comprises at least one of friend number and basic information of friends, and the basic information of friends of the user comprises at least one of gender, age and academic calendar of the friends.

Fig. 3 illustrates another flow diagram of an entity object classification method for selectively integrating heterogeneous models according to one or more embodiments of the present disclosure.

As shown in fig. 3, the entity object classification method for selectively integrating heterogeneous models includes:

step 202: obtaining a training data set D_TAnd validating the data set D_V。

Optionally, the data in the training data set and the validation data set are both provided with classification labels.

Step 204: training by using the training data set to obtain at least two groups of base classifiers; wherein at least one of the at least two groups of base classifiers has a type different from the other groups of base classifiers.

Optionally, the base classifier comprises at least one of a logistic regression model (LR), a decision tree model, a random forest model, a neural network model.

For example, select n types of base classifiers M₁,M₂,…,M_n. Referring to FIG. 1, M₁Representing a logistic regression model, M₂Representing a random forest model, M_nRepresenting a neural network model. The parameter k is the number of base classifiers trained in each set of base classifiers.

For example, for each set of base classifiers, based on different parameters, a candidate model is trained, respectively denoted as M₁₁,M₁₂,…M_1k, M₂₁, M₂₂, …M_2k,…,M_n1,M_n2, …M_nkWherein M is_ijPresentation base classifier M_jAnd (5) obtaining n x k candidate models (base classifiers) in total from the sub models obtained under the ith group of parameters.

Step 206: determining a first predetermined number (e.g., 10) of weight vectors for the at least two sets of base classifiers

(ii) a Wherein the weight vector

Each term in (a) is a number between 0 and 1; each weight vector comprises a sub-weight vector omega corresponding to each group of base classifiers i_i1,ω_i2,…,ω_ik。

Step 208: for each of said weight vectors, according to each sub-weight vector ω_i1,ω_i2,…,ω_ikSelecting a second predetermined number of weights and setting the values of the rest weights in the sub-weight vector as 0 to obtain a first predetermined number of corrected weight vectors.

For example, for each set of sub-weight vectors, the one with the largest value of weight may be selected. Let the maximum term in the first set of sub-weight vectors be ω₁₁Then, after processing in step 208, the sub-weight vector ω₁₁,ω₁₂,…,ω_1kBecomes omega₁₁0, …,0, where only one entry is not 0 (meaning that only one of each class of base classifier is selected); let the maximum term in the second set of sub-weight vectors be ω₂₂Then, after processing in step 208, the sub-weight vector ω₂₁,ω₂₂,…,ω_2kBecomes 0, omega₂₂0, …, 0; by analogy, assume that the maximum term in the nth group of sub-weight vectors is ω_nkThen, after processing in step 208, the sub-weight vector ω_n1,ω_n2,…,ω_nkBecomes 0, …,0, ω_nk. Thus, the resulting correction weight vector is: omega₁₁,0,…,0,ω₂₂,0,…,0,…,0,…,0,ω_nk。

For another example, for each set of sub-weight vectors, the first two terms with the largest value of weight may be selected. Let ω be the maximum two terms in the first set of sub-weight vectors₁₁And ω₁₂Then, after processing in step 208, the sub-weight vector ω₁₁,ω₁₂,…,ω_1kBecomes omega₁₁,ω₁₂0, …,0, where only two of the terms are not 0 (meaning that only two of them are selected in each class of base classifier); let ω be the largest two terms in the second set of sub-weight vectors₂₂And ω₂₃Then, after processing in step 208, the sub-weight vector ω₂₁,ω₂₂,…,ω_2kBecomes 0, omega₂₂,ω₂₃0, …, 0; and so on; let ω be the largest two terms in the nth set of sub-weight vectors_nk-1And ω_nkThen, after processing in step 208, the sub-weight vector ω_n1,ω_n2,…,ω_nkBecomes 0, …,0, ω_nk-1,ω_nk. Thus, the resulting correction weight vector is: omega₁₁,ω₁₂,0,…,0, 0,ω₂₂,ω₂₃,0,…,0,…,0,…,0,ω_nk-1,ω_nk。

Other examples refer to the principles of the previous embodiments and are not described in detail herein.

Step 210: and normalizing the correction weight vector to obtain the correction weight vector after normalization is completed, so that the sum of all weight values in the correction weight vector is 1.

Step 212: and predicting the data in the verification data set by using the normalized correction weight vectors of the first preset number and combining the at least two groups of base classifiers, and calculating the score of each correction weight vector according to the prediction result.

For example, the data set D will be validated_VIs represented by the mark of

(a vector of length equal to the number of prediction samples) and P represents the prediction result for each model

The concatenated matrix (a matrix with length equal to the number of predicted samples and width equal to the number n x k of candidate models) is represented by ω, which is a correction weight vector that can be calculated when ω takes different values

Representing the final result of integrating the predicted results of the different models according to the weight vector omega, further based on

And

various evaluation indexes can be calculated to evaluate the quality of the current weight vector. The optimization goal of the embodiment is to obtain a suitable ω, so that a better evaluation index is obtained.

Optionally, predicting data in the verification data set by using the first predetermined number of normalized correction weight vectors in combination with the at least two sets of base classifiers, and calculating a score of each correction weight vector according to a prediction result, including:

selecting a base classifier corresponding to the weight with the value not being zero from the at least two groups of base classifiers according to the weight with the value not being zero in the normalized correction weight vector to form a classification model;

inputting the data in the verification data set into the classification model, and performing weighted prediction according to the normalized correction weight vector to obtain a prediction result;

and according to the prediction result, obtaining the corresponding score of the correction weight vector according to a preset model performance evaluation method.

Step 214: and regenerating a first preset number of weight vectors by using an evolution algorithm (also called an evolutionary algorithm) in combination with the correction weight vectors and the scores thereof, and repeating the steps from the calculation of the correction weight vectors to the calculation of the scores of the correction weight vectors by using the regenerated weight vectors to obtain the scores of the correction weight vectors in a new round.

In the step, the correction weight vectors and the scores thereof obtained by calculation in the step are combined through an evolution algorithm to regenerate the weight vectors with the first preset number, so that the weight vectors can be optimized by the evolution algorithm. Optionally, when the weight vectors need to be regenerated by using the evolutionary algorithm, in the foregoing step 206, i.e., in the first round, the first predetermined number of weight vectors of the at least two groups of base classifiers may be determined by randomly generating the weight vectors.

Optionally, at least one of the Genetic Algorithms (Genetic Algorithms), Genetic Programming (Genetic Programming), Evolution Strategies (Evolution Strategies), and Evolution Programming (Evolution Programming).

Step 216: the previous steps are repeated until a specified number of rounds (e.g. 10 rounds) is reached.

Step 218: and determining the correction weight vector with the highest score in the scores of all the correction weight vectors obtained in all the rounds, and selecting a second preset number of base classifiers from each group of base classifiers on the basis of the correction weight vector with the highest score.

Step 220: and combining the selected base classifiers to obtain a target classification model for classification prediction. For a new sample to be classified, weighted prediction is carried out only by combining the prediction result of the target classification model with the weight of each model.

As an embodiment, the entity object classification method of the selective integration heterogeneous model is used for classifying user properties; the training data set and the verification data set comprise at least one of user basic information, user dynamic information and user relation information; the basic information of the user comprises at least one of gender, age and academic calendar, the dynamic information of the user comprises at least one of browsing record and consumption record of the user in a preset period, the user relationship information comprises at least one of friend number and basic information of friends, and the basic information of friends of the user comprises at least one of gender, age and academic calendar of the friends.

In the entity object classification method for selectively integrating the heterogeneous models, provided by one or more embodiments of the present specification, various heterogeneous base classifiers are introduced, and the advantages of each base classifier can be fully utilized to make the effect of the overall model more robust; for each type of base classifier, a scheme of selective integration is adopted, so that the performance of each base classifier can be better mined, the overall effect is better, and further, the strong dependence of data on a specific model can be reduced. Furthermore, the method is simple. In the selection process of the candidate model, an evolution algorithm of weight vector correction is adopted, and a better solution can be efficiently obtained aiming at the complex optimization problem with more constraint conditions in the embodiment.

It should be noted that the method of one or more embodiments of the present disclosure may be performed by a single device, such as a computer or server. The method of the embodiment can also be applied to a distributed scene and completed by the mutual cooperation of a plurality of devices. In such a distributed scenario, one of the devices may perform only one or more steps of the method of one or more embodiments of the present disclosure, and the devices may interact with each other to complete the method.

Fig. 4 is a block diagram illustrating a classification apparatus for selectively integrating heterogeneous models according to one or more embodiments of the present disclosure.

As shown in fig. 4, the classification apparatus for selectively integrating heterogeneous models includes:

an obtaining module 301, configured to obtain a training data set and a verification data set;

a training module 302, configured to train to obtain at least two sets of heterogeneous base classifiers by using the training data set;

a base classifier combination generation and scoring module 303, configured to cyclically execute the following steps of generating and scoring a base classifier combination according to a specified number of rounds:

the classification module 304 is configured to determine a combination of the base classifiers with the highest scores in all the rounds, and obtain the selective integration heterogeneous model based on the combination of the base classifiers with the highest scores in combination with weights corresponding to the base classifiers included in the combination of the base classifiers, so as to perform classification prediction.

The classification device for selectively integrating heterogeneous models provided by one or more embodiments of the present specification proposes a solution for selectively integrating heterogeneous models, and includes a heterogeneous base classifier in ensemble learning, each type of base classifier gives different parameter combinations to learn in a learning stage to obtain a plurality of models, and in a selection stage, one or more of the models are selected as a component of a final model. By the mode, the characteristics of different models which are long are fully utilized, complementation is achieved, and the robustness and effectiveness of the whole model are improved.

Optionally, the base classifier combination generation and scoring module is configured to:

generating a plurality of base classifier combinations; each base classifier combination is obtained by endowing each base classifier with a weight in a mode of randomly generating the weight and selecting at least one base classifier from each group of base classifiers according to the sequence of the weights from large to small;

Optionally, the classification device of the selective integration heterogeneous model is used for classifying the user property; the training data set and the verification data set comprise at least one of user basic information, user dynamic information and user relation information; the basic information of the user comprises at least one of gender, age and academic calendar, the dynamic information of the user comprises at least one of browsing record and consumption record of the user in a preset period, the user relationship information comprises at least one of friend number and basic information of friends, and the basic information of friends of the user comprises at least one of gender, age and academic calendar of the friends.

For convenience of description, the above devices are described as being divided into various modules by functions, and are described separately. Of course, the functionality of the modules may be implemented in the same one or more software and/or hardware implementations in implementing one or more embodiments of the present description.

The apparatus of the foregoing embodiment is used to implement the corresponding method in the foregoing embodiment, and has the beneficial effects of the corresponding method embodiment, which are not described herein again.

Fig. 5 is a schematic diagram illustrating a more specific hardware structure of an electronic device according to this embodiment, where the electronic device may include: a processor 401, a memory 402, an input/output interface 403, a communication interface 404, and a bus 405. Wherein the processor 401, the memory 402, the input/output interface 403 and the communication interface 404 are communicatively connected to each other within the device by a bus 405.

The processor 401 may be implemented by a general-purpose CPU (Central Processing Unit), a microprocessor, an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits, and is configured to execute related programs to implement the technical solutions provided in the embodiments of the present specification.

The Memory 402 may be implemented in the form of a ROM (Read Only Memory), a RAM (Random access Memory), a static storage device, a dynamic storage device, or the like. The memory 402 may store an operating system and other application programs, and when the technical solution provided by the embodiments of the present specification is implemented by software or firmware, the relevant program codes are stored in the memory 402 and called to be executed by the processor 401.

The input/output interface 403 is used for connecting an input/output module to realize information input and output. The i/o module may be configured as a component in a device (not shown) or may be external to the device to provide a corresponding function. The input devices may include a keyboard, a mouse, a touch screen, a microphone, various sensors, etc., and the output devices may include a display, a speaker, a vibrator, an indicator light, etc.

The communication interface 404 is used to connect a communication module (not shown in the figure) to implement communication interaction between the present device and other devices. The communication module can realize communication in a wired mode (such as USB, network cable and the like) and also can realize communication in a wireless mode (such as mobile network, WIFI, Bluetooth and the like).

The bus 405 includes a path that transfers information between the various components of the device, such as the processor 401, memory 402, input/output interface 403, and communication interface 404.

It should be noted that although the above-mentioned device only shows the processor 401, the memory 402, the input/output interface 403, the communication interface 404 and the bus 405, in a specific implementation, the device may also include other components necessary for normal operation. In addition, those skilled in the art will appreciate that the above-described apparatus may also include only those components necessary to implement the embodiments of the present description, and not necessarily all of the components shown in the figures.

Computer-readable media of the present embodiments, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Those of ordinary skill in the art will understand that: the discussion of any embodiment above is meant to be exemplary only, and is not intended to intimate that the scope of the disclosure, including the claims, is limited to these examples; within the spirit of the present disclosure, features from the above embodiments or from different embodiments may also be combined, steps may be implemented in any order, and there are many other variations of different aspects of one or more embodiments of the present description as described above, which are not provided in detail for the sake of brevity.

In addition, well-known power/ground connections to Integrated Circuit (IC) chips and other components may or may not be shown in the provided figures, for simplicity of illustration and discussion, and so as not to obscure one or more embodiments of the disclosure. Furthermore, devices may be shown in block diagram form in order to avoid obscuring the understanding of one or more embodiments of the present description, and this also takes into account the fact that specifics with respect to implementation of such block diagram devices are highly dependent upon the platform within which the one or more embodiments of the present description are to be implemented (i.e., specifics should be well within purview of one skilled in the art). Where specific details (e.g., circuits) are set forth in order to describe example embodiments of the disclosure, it should be apparent to one skilled in the art that one or more embodiments of the disclosure can be practiced without, or with variation of, these specific details. Accordingly, the description is to be regarded as illustrative instead of restrictive.

While the present disclosure has been described in conjunction with specific embodiments thereof, many alternatives, modifications, and variations of these embodiments will be apparent to those of ordinary skill in the art in light of the foregoing description. For example, other memory architectures (e.g., dynamic ram (dram)) may use the discussed embodiments.

It is intended that the one or more embodiments of the present specification embrace all such alternatives, modifications and variations as fall within the broad scope of the appended claims. Therefore, any omissions, modifications, substitutions, improvements, and the like that may be made without departing from the spirit and principles of one or more embodiments of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An entity object classification method for selectively integrating heterogeneous models comprises the following steps:

2. The method of claim 1, wherein the method further comprises: the step of generating and scoring the first round of base classifier combinations specifically comprises:

3. The method of claim 1, wherein the evolutionary algorithm employs at least one of a genetic algorithm, a genetic plan, an evolutionary strategy, and an evolutionary plan.

4. The method of claim 1, wherein the data in the training dataset and the validation dataset are each labeled with a classification.

5. The method of claim 1, wherein the base classifier comprises at least one of a logistic regression model, a support vector machine model, a decision tree model, a gradient descent decision tree model, a random forest model, a neural network model.

6. The method according to any of claims 1-5, wherein the method is used for classifying user properties; the training data set and the verification data set comprise at least one of user basic information, user dynamic information and user relation information; the basic information of the user comprises at least one of gender, age and academic calendar, the dynamic information of the user comprises at least one of browsing record and consumption record of the user in a preset period, the user relationship information comprises at least one of friend number and basic information of friends, and the basic information of friends of the user comprises at least one of gender, age and academic calendar of the friends.

7. An entity object classification apparatus selectively integrating heterogeneous models, comprising:

8. The apparatus of claim 7, wherein the base classifier combination generation and scoring module is to:

9. The apparatus of claim 7, wherein the evolutionary algorithm employs at least one of a genetic algorithm, a genetic plan, an evolutionary strategy, and an evolutionary plan.

10. The apparatus of claim 7, wherein the data in the training data set and the validation data set are each labeled with a classification.

11. The apparatus of claim 7, wherein the base classifier comprises at least one of a logistic regression model, a support vector machine model, a decision tree model, a gradient descent decision tree model, a random forest model, a neural network model.

12. The apparatus according to any of claims 7-11, wherein the apparatus is configured to classify a user property; the training data set and the verification data set comprise at least one of user basic information, user dynamic information and user relation information; the basic information of the user comprises at least one of gender, age and academic calendar, the dynamic information of the user comprises at least one of browsing record and consumption record of the user in a preset period, the user relationship information comprises at least one of friend number and basic information of friends, and the basic information of friends of the user comprises at least one of gender, age and academic calendar of the friends.

13. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the method of any one of claims 1 to 6 when executing the program.

14. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1 to 6.