CN112561082A

CN112561082A - Method, device, equipment and storage medium for generating model

Info

Publication number: CN112561082A
Application number: CN202011530270.5A
Authority: CN
Inventors: 刘昊骋
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2020-12-22
Filing date: 2020-12-22
Publication date: 2021-03-26
Also published as: EP3893169A2; JP7304384B2; JP2022033695A; US20210319366A1; EP3893169A3

Abstract

The application discloses a method, a device, equipment and a storage medium for generating a model, and relates to the technical field of artificial intelligence such as machine learning and big data processing. One embodiment of the method comprises: acquiring sample resource characteristics and sample labels corresponding to the sample resource characteristics; determining a first screening factor according to the sample resource characteristics and the sample label, and determining first resource characteristics from the sample resource characteristics according to the first screening factor; determining a second screening factor according to parameters associated with a pre-trained logistic regression LR model, determining a second resource feature from the first resource feature based on the second screening factor, and obtaining the feature of the target model based on the second resource feature; and taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training the machine learning model to obtain the target model. This application can reduce time and manpower and consume.

Description

Method, device, equipment and storage medium for generating model

Technical Field

The embodiment of the application relates to the technical field of computers, in particular to the technical field of artificial intelligence such as machine learning and big data processing, and particularly relates to a method, a device, equipment and a storage medium for generating a model.

Background

In recent years, as the most basic algorithm of machine learning, a Logistic Regression (LR) model plays an important role in establishing a target model.

At present, LR screens the model-entering characteristics for the target model according to the characteristic engineering and the characteristic screening, and then trains and generates the target model according to the model-entering characteristics.

Disclosure of Invention

The embodiment of the application provides a method, a device, equipment and a storage medium for generating a model.

In a first aspect, an embodiment of the present application provides a method for generating a model, including: acquiring sample resource characteristics and sample labels corresponding to the sample resource characteristics; determining a first screening factor according to the sample resource characteristics and the sample label, and determining first resource characteristics from the sample resource characteristics according to the first screening factor; determining a second screening factor according to parameters associated with a pre-trained logistic regression LR model, determining a second resource feature from the first resource feature based on the second screening factor, and obtaining the feature of the target model based on the second resource feature; and taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training the machine learning model to obtain the target model.

In a second aspect, an embodiment of the present application provides an apparatus for generating a model, including: a sample acquisition module configured to acquire sample resource features and sample labels corresponding to the sample resource features; a first determination module configured to determine a first screening factor based on the sample resource characteristics and the sample tags, and determine a first resource characteristic from the sample resource characteristics based on the first screening factor; a second determination module configured to determine a second screening factor according to a parameter associated with the pre-trained logistic regression LR model, and determine a second resource feature from the first resource features based on the second screening factor, obtain a feature of the target model based on the second resource feature; and the model training module is configured to train the machine learning model by taking the characteristics of the target model as the input of the target model and taking the sample labels corresponding to the characteristics of the target model as the output of the target model to obtain the target model.

In a third aspect, an embodiment of the present application provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method as described in the first aspect.

In a fourth aspect, embodiments of the present application propose a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform the method as described in the first aspect.

In a fifth aspect, embodiments of the present application propose a computer program product comprising a computer program that, when executed by a processor, implements the method as described in the first aspect.

According to the method, the device, the equipment and the storage medium for generating the model, the sample resource characteristics and the sample labels corresponding to the sample resource characteristics are firstly obtained; then, according to the sample resource characteristics and the sample label, determining a first screening factor, and according to the first screening factor, determining first resource characteristics from the sample resource characteristics; then determining a second screening factor according to parameters associated with a pre-trained logistic regression LR model, determining a second resource feature from the first resource feature based on the second screening factor, and obtaining the feature of the target model based on the second resource feature; and finally, taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training the machine learning model to obtain the target model. Furthermore, the method avoids the need of relying on a large amount of feature engineering, feature screening and model interpretability when determining the model entering features of the target model according to the LR model, thereby reducing the time and labor consumption.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings. The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIG. 1 is an exemplary system architecture to which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a method of generating a model according to the present application;

FIG. 3 is a flow diagram of another embodiment of a method of generating a model according to the present application;

FIG. 4 is a flow diagram of yet another embodiment of a method of generating a model according to the present application;

FIG. 5 is a scene diagram of a method of generating a model that may implement an embodiment of the present application;

FIG. 6 is a schematic block diagram of one embodiment of an apparatus for generating a model according to the present application;

FIG. 7 is a block diagram of an electronic device for implementing a method of generating a model according to an embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

FIG. 1 illustrates an exemplary system architecture 100 to which embodiments of the present method of generating a model or apparatus for generating a model may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. Various applications, such as various client applications, multi-party interactive applications, artificial intelligence applications, etc., may be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be various electronic devices that support document processing applications, including but not limited to smart terminals, tablets, laptop and desktop computers, and the like. When the

terminal apparatuses

101, 102, 103 are software, they can be installed in the electronic apparatuses listed above. It may be implemented, for example, as multiple software or software modules to provide distributed services, or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a background server providing support for the

terminal devices

101, 102, 103. The background server can analyze and process the received data such as the request and feed back the processing result to the terminal equipment.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules, for example, to provide distributed services, or as a single piece of software or software module. And is not particularly limited herein.

In practice, the method for training the tag generation model or the method for determining the service tag provided in the embodiment of the present application may be performed by the

terminal device

101, 102, 103 or the server 105, and the apparatus for training the tag generation model or the apparatus for determining the service tag may also be disposed in the

terminal device

101, 102, 103 or the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow 200 of one embodiment of a method of generating a model according to the present application is shown. The method for generating the model comprises the following steps:

step 201, obtaining sample resource characteristics and sample labels corresponding to the sample resource characteristics.

In this embodiment, an executing entity (e.g., the server 105 shown in fig. 1) of the method for generating a model may obtain a plurality of sample resource features and a plurality of sample tags corresponding to the sample resource features from a local or terminal device (e.g., the

terminal devices

101, 102, 103 shown in fig. 1), where each sample resource feature corresponds to one sample tag. The sample resource characteristics may be resource-related characteristics. For example: resource exchange frequency characteristics and transaction resource characteristics; resource usage (number of uses of a given resource/number of total receptions of a given resource); payment characteristics such as payment amount characteristics, payment frequency and the like extracted through the payment information; user preference characteristics for resources, user operation characteristics for resources (clicking, attention, purchasing, etc.), user information characteristics (age, gender, etc.).

The sample label can be obtained by adopting a manual label, a rule label and a clustering label. Wherein, the rule labeling label is: by setting some filtering conditions (rules), labeling part of data to obtain training set. The clustering label is: after the characteristic engineering, different types of labels are obtained through a clustering method.

It should be noted that the present embodiment may be applied to a "binary" application scenario. For example, the loan may be analyzed according to the income, deposit, occupation, age, etc. of the user to determine whether the loan is available; and judging whether the mail belongs to the junk mail according to the mail content. The product or advertisement is recommended, and the product or advertisement may be recommended to the user based on the preference of the user for the product or advertisement, the operation of the user for the product or advertisement, the user information, and the like.

Step 202, determining a first screening factor according to the sample resource characteristics and the sample label, and determining the first resource characteristics from the sample resource characteristics according to the first screening factor.

In this embodiment, the execution subject may match a corresponding screening factor according to the sample resource feature and/or the sample label; alternatively, the first screening factor may be determined from a correlation parameter involved in training a model with the sample resource features and the sample labels. The first filtering factor may be used to filter a first resource feature from a plurality of sample resource features.

Before matching the corresponding screening factors according to the sample resource characteristics and/or the sample labels, the method for generating the model may further include: and pre-establishing a screening factor with corresponding relation between the sample resource characteristics and/or the sample labels.

In this embodiment, the executing entity may screen the first resource feature from the plurality of sample resource features according to the first screening factor; alternatively, the execution subject may sort the plurality of sample resource features, and then screen out the first resource feature from a preset number of sample resource features in the sorted plurality of sample resource features.

Step 203, determining a second screening factor according to parameters associated with a pre-trained logistic regression LR model; and determining a second resource characteristic from the first resource characteristic based on the second screening factor, and obtaining the characteristic of the target model based on the second resource characteristic.

In this embodiment, the executing entity may determine the second filtering factor according to a parameter associated with a pre-trained logistic regression LR model; and determining a second resource characteristic from the first resource characteristic based on the second screening factor, and obtaining the characteristic of the target model based on the second resource characteristic. The parameters associated with the pre-trained LR model may be parameters involved in the pre-trained LR model process, such as variable coefficients, information values, model component stability evaluation indexes, and variance expansion coefficients.

In this embodiment, the execution subject may screen the second resource feature from the first resource features according to the second screening factor. The second filtering factor may be used to filter the second resource characteristic from the first resource characteristic.

In this embodiment, the executing entity may use the second resource characteristic as a characteristic of the target model; or processing the second resource characteristics to obtain processed resource characteristics; then, the processed resource features are used as features of the target model, for example, the second resource features are subjected to binning processing to obtain binned features.

And 204, taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training the machine learning model to obtain the target model.

In this embodiment, after obtaining the features of the target model and the sample labels corresponding to the features of the target model, the executing entity may train the machine learning model by using the features of the target model and the sample labels corresponding to the features of the target model to obtain the target model. In training, the executing agent may use the features of the target model as inputs of the target model, and use the corresponding sample labels corresponding to the features of the target model as expected outputs to obtain the target model. The machine learning model may be a probability model, a classification model, or other classifier in the prior art or future development technology, for example, the machine learning model may include any one of the following: a Decision Tree model (XGBoost), a logistic regression model (LR), a deep neural network model (DNN), a Gradient Boosting Decision Tree model (GBDT).

According to the method for generating the model, firstly, sample resource characteristics and sample labels corresponding to the sample resource characteristics are obtained; then, according to the sample resource characteristics and the sample label, determining a first screening factor, and according to the first screening factor, determining first resource characteristics from the sample resource characteristics; then determining a second screening factor according to parameters associated with a pre-trained logistic regression LR model, determining a second resource feature from the first resource feature based on the second screening factor, and obtaining the feature of the target model based on the second resource feature; and finally, taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training the machine learning model to obtain the target model. Furthermore, the method avoids the need of relying on a large amount of feature engineering, feature screening and model interpretability when determining the model entering features of the target model according to the LR model, thereby reducing the time and labor consumption.

In some optional implementations of this embodiment, determining the first screening factor according to the sample resource feature and the sample label includes: training the decision tree XGboost model based on the sample resource characteristics and the sample labels corresponding to the sample resource characteristics to obtain the XGboost model; a first screening factor is determined based on a parameter associated with the XGBoost model.

In this implementation manner, the execution subject may determine the first screening factor according to the sample resource feature and a parameter involved in training a decision tree (XGBoost) model by the sample label. The parameters associated with the XGBoost model may be parameters involved in training the XGBoost model, such as coverage, correlation coefficients, and the like.

In this implementation manner, after obtaining the sample resource features and the sample labels corresponding to the sample resource features, the execution subject may train the XGBoost model by using the sample resource features and the sample labels corresponding to the sample resource features to obtain the XGBoost model. During training, the execution subject may use the sample resource features as input of the XGBoost model, and use the input corresponding sample labels corresponding to the sample resource features as expected output to obtain the XGBoost model.

In the implementation mode, the first screening factor is determined through the sample resource characteristics and the associated parameters of the sample labels in the process of training the XGboost model.

In some optional implementations of the present embodiment, the parameters associated with the pre-trained XGBoost model include: coverage and correlation coefficient; and determining a first screening factor based on parameters associated with the pre-trained XGBoost model, comprising: a first screening factor is determined based on the coverage and/or correlation coefficient.

In this implementation manner, the execution subject may determine a first filtering factor according to the coverage (coverage) and the correlation coefficient; or, the executing body can determine a first screening factor according to the coverage; alternatively, the execution subject may determine the first filtering factor based on the correlation coefficient. The coverage is (number of samples-number of samples with missing features)/number of samples, the number of samples may be the number of all samples involved in the process of training the XGBoost model, and the number of samples with missing features may be the number of samples with missing features in all samples. The correlation may be a correlation coefficient of the sample resource feature and the corresponding sample label.

It should be noted that, when determining the first filtering factor based on the coverage and the correlation coefficient, the user may also set weights corresponding to the coverage and the correlation coefficient according to the feature filtering requirement, and then perform weighted summation to obtain the first filtering factor.

In this implementation, the determination of the first screening factor is achieved by means of a coverage (coverage) and/or a correlation coefficient (cor).

In some optional implementations of this embodiment, the method for generating a model further includes: sequencing the sample resource characteristics to obtain sequenced sample resource characteristics; determining a first resource feature from the sample resource features, comprising: and determining a first resource characteristic from a preset number of sample resource characteristics in the sorted sample resource characteristics according to the first screening factor.

In this implementation, the executing entity may also rank the plurality of sample resource features, for example, feature importance, before determining the first resource feature from the plurality of sample resource features; then, a first resource feature is determined from the ordered plurality of sample resource features of the preset number. The first resource feature may be a part of the sample resource features screened out based on the first screening factor. The preset number of sample resource features may be set according to the performance of the target model or set by the user. For example, sample resource features ranked at top 10.

In this implementation, the first resource feature may be determined from a preset number of sample resource features in the sorted sample resource features based on the first filtering factor.

In some optional implementation manners of this embodiment, the sorting the sample resource features to obtain sorted sample resource features includes: and sequencing the sample resource features according to the feature importance of the sample resource features to obtain the sequenced sample resource features.

In this implementation manner, the feature importance of each sample resource feature in the plurality of sample resource features may be calculated first, and then the plurality of sample resource features may be ranked based on the feature importance of each sample resource feature. The feature importance may be calculated from weight gain.

In one specific example, a first resource feature is screened from a plurality of sample resource features according to weight gain >10, coverage > 5%, and cor < 0.7.

In the implementation manner, the sample resource features are sorted according to the feature importance of the sample resource features, and a preset number of sample resource features in the sorted sample resource features are used as candidate features of the first resource features.

In some optional implementations of this embodiment, before determining the second screening factor according to a parameter associated with the pre-trained logistic regression LR model, the method for generating a model further includes: and training the logistic regression LR model based on the first resource characteristics and the sample labels corresponding to the first resource characteristics to obtain the LR model.

In this implementation, after obtaining the first resource feature and the sample label corresponding to the first resource feature, the executing entity may train the LR model using the first resource feature and the sample label corresponding to the first resource feature to obtain the LR model. During training, the executive agent may take the first resource feature as an input of the LR model and a sample label corresponding to the first resource feature corresponding to the input as an expected output, resulting in the LR model.

In this implementation, the LR model may be obtained through training of the first resource feature and the sample label corresponding to the first resource feature.

In some optional implementations of this embodiment, the parameters associated with the LR model include at least one of: the parameter set comprises a variable coefficient (coef), a P Value, an Information Value (IV), a model component Stability evaluation Index (PSI), and a variance expansion coefficient (VIF), wherein the P Value is a parameter for judging the inspection result of a pre-trained LR model.

In this implementation, the executing entity may determine the second filtering factor according to the parameter associated with the LR model including at least one of: the parameter set comprises a variable coefficient (coef), a P Value, an Information Value (IV), a model component Stability evaluation Index (PSI), and a variance expansion coefficient (VIF), wherein the P Value is a parameter for judging the inspection result of a pre-trained LR model. The above PSI can be used to measure the distribution difference of the scores of the test samples and the model development samples. The above VIF can be used to measure the severity of complex (multiple) collinearity in a multivariate linear LR model. The above IV can be used to measure the predictive power of the independent variable; the value screening range of the IV can be based on experience, and can also be set by a user according to needs.

In one specific example, the second filtering factor is determined according to coef No. 0, P value <0.05, IV >0.5, PSI <0.05, VIF < 5.

In this implementation, the determination of the second screening factor may be implemented based on a parameter associated with the LR model.

In some optional implementations of this embodiment, the method for generating a model further includes: adjusting a hyper-parameter of the target model according to one of: grid search, random search and Bayesian optimization.

The hyper-parameters may be parameters set before the target model is trained based on the machine learning model, and the hyper-parameters are not parameters obtained in the process of training the target model based on the machine learning model.

In the implementation mode, the hyper-parameters are optimized through grid search, random search or Bayesian optimization, and a group of optimal hyper-parameters is selected to improve the iteration efficiency of the target model.

In some optional implementations of this embodiment, the sample resource characteristic includes one of: sample image features, sample text features, sample speech features.

In this implementation, the sample resource features may include any one of sample image features, sample text features, or sample speech features. The sample image features may be sample resource features presented in the form of images. The sample text features may be sample resource features presented in text form. The sample speech features may be sample resource features presented in speech form.

In the implementation mode, the corresponding sample resource characteristics can be obtained from the angles of images, texts, voices and the like, so that the obtained target model can accurately predict the sample image characteristics, the sample text characteristics and the sample voice characteristics.

Referring to FIG. 3, a flow 300 of another embodiment of a method of generating a model according to the present application is shown. The method for generating the model comprises the following steps:

step 301, obtaining sample resource characteristics and sample labels corresponding to the sample resource characteristics.

In this embodiment, the specific operation of step 301 has been described in detail in step 201 in the embodiment shown in fig. 2, and is not described herein again.

Step 302, training the decision tree XGboost model based on the sample resource characteristics and the sample labels corresponding to the sample resource characteristics to obtain the XGboost model.

In this implementation, after obtaining the sample resource characteristics and the sample labels corresponding to the sample resource characteristics, an executing entity (e.g., the server 105 in fig. 1) of the method for generating the model may train the XGBoost model using the sample resource characteristics and the sample labels corresponding to the sample resource characteristics, so as to obtain the XGBoost model. During training, the execution subject may use the sample resource features as input of the XGBoost model, and use the input corresponding sample labels corresponding to the sample resource features as expected output to obtain the XGBoost model.

Step 303, determining a first screening factor based on a parameter associated with the XGBoost model; and determining a first resource characteristic from the sample resource characteristics according to the first screening factor.

In this embodiment, the specific operation of step 303 has been described in detail in step 202 in the embodiment shown in fig. 2, and is not described herein again.

Step 304, training the logistic regression LR model based on the first resource feature and the sample label corresponding to the first resource feature to obtain an LR model.

In this embodiment, after obtaining the first resource feature and the sample label corresponding to the first resource feature, the executing entity may train the LR model by using the first resource feature and the sample label corresponding to the first resource feature, so as to obtain the LR model. During training, the executive agent may take the first resource feature as an input of the LR model and a sample label corresponding to the first resource feature corresponding to the input as an expected output, resulting in the LR model.

Step 305, determining a second screening factor based on the parameter associated with the LR model; and determining a second resource characteristic from the first resource characteristics based on the second screening factor, and obtaining the characteristics of the target model based on the second resource characteristic.

In this embodiment, the specific operation of step 305 has been described in detail in step 203 in the embodiment shown in fig. 2, and is not described herein again.

And step 306, taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training machine learning to obtain the target model.

In this embodiment, the specific operation of step 306 is described in detail in step 204 in the embodiment shown in fig. 2, and is not described herein again.

As can be seen from fig. 3, compared to the embodiment corresponding to fig. 2, the flow 300 of the method for generating a model in the present embodiment highlights the steps of determining the first screening factor and the second screening factor. Therefore, the scheme described in the embodiment improves the accuracy of the features of the target model, so that the accuracy of the target model is improved.

In one specific example, the second filtering factor is determined according to coef No. 0, P value <0.05, IV >0.5, PSI <0.05, and VIF < 5. The value ranges of coef common sign, P value, IV, PSI and VIF can be set by the recognition precision of the target model to be trained.

In this implementation, the determination of the second screening factor may be implemented with a parameter associated with the LR model.

Referring to FIG. 4, a flow 400 of yet another embodiment of a method of generating a model according to the present application is shown. The method for generating the model comprises the following steps:

step 401, obtaining a sample resource feature and a sample label corresponding to the sample resource feature.

Step 402, training the decision tree XGboost model based on the sample resource characteristics and the sample labels corresponding to the sample resource characteristics to obtain the XGboost model.

Step 403, determining a first screening factor based on the parameters associated with the XGBoost model; and determining a first resource characteristic from the sample resource characteristics according to the first screening factor.

Step 404, training the logistic regression LR model based on the first resource feature and the sample label corresponding to the first resource feature to obtain an LR model.

Step 405, determining a second screening factor based on the parameter associated with the LR model; and determining a second resource characteristic from the first resource characteristics based on the second filtering factor.

In the present embodiment, the specific operations of

steps

401 and 405 have been described in detail in

step

301 and 305 in the embodiment shown in fig. 3, and are not described herein again.

And 406, performing binning on the second resource features to obtain binned resource features, and determining evidence weights corresponding to the binned resource features.

In this embodiment, an executing entity (for example, the server 105 shown in fig. 1) of the method for generating a model may bin the second resource features to obtain the binned resource features; and then, calculating the evidence weight corresponding to each binned resource feature. The binning may include one of: equal-frequency sub-boxes, equidistant sub-boxes and chi-square sub-boxes.

The above-mentioned weight of evidence (WOE) may be a difference method for measuring the distribution of a normal sample (Good) and a Bad sample (Bad), and the WOE may be determined by the following formula:

WOE＝ln(Distr Good/Distr Bad)

wherein, Distr Good is a normal sample, and Distr Bad is a Bad sample.

It should be noted that, in the process of building the model, the continuous variable (i.e. the second resource characteristic) needs to be discretized; then, model training is carried out through the discretized features to obtain a trained model, the performance of the trained model is more stable, and the risk of overfitting of the trained model is reduced.

Step 407, in response to that the change rule of the evidence weights corresponding to all the binned resource features conforms to a preset rule, taking the binned resource features as the features of the target model.

In this embodiment, when the change rule of the evidence weights corresponding to all the binned resource features of the execution subject conforms to the preset rule, the binned resource features may be used as the features of the target model. The change rule can be that the evidence weights corresponding to all the binned features are increased, decreased, increased and decreased first and then increased second, increased and then increased first and then increased second, decreased first and then increased and then decreased second and the like.

And 408, taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training machine learning to obtain the target model.

In this embodiment, the specific operation of step 408 has been described in detail in step 306 in the embodiment shown in fig. 3, and is not described herein again.

As can be seen from fig. 4, compared to the embodiment corresponding to fig. 3, the flow 400 of the method for generating a model in the present embodiment highlights the step of binning the second resource features. Therefore, the monotonicity of the second resource characteristic is improved by the scheme described in the embodiment, so that the performance of the target model is more stable, and the overfitting risk of the target model is reduced.

In some optional implementations of this embodiment, the method for generating a model further includes: in response to that the change rules of the evidence weights corresponding to all the binned resource features do not accord with the preset rules, merging the binned resource features, and calculating the evidence weights of the merged resource features; responding to that the change rule of the evidence weight corresponding to all the classified resource features accords with a preset rule, and taking the classified resource features as the features of the target model, wherein the change rule comprises the following steps: and responding to the change rule of the evidence weight of the combined resource features to accord with a preset rule, and taking the combined resource features as the features of the target model.

In the implementation mode, when the change rule of the evidence weights corresponding to all the binned resource features does not accord with the preset rule, the binned resource features can be combined; then, judging whether the evidence weight corresponding to the combined resource features accords with a preset rule or not, if not, combining the binned resource features again, and judging whether the evidence weight corresponding to the combined resource features accords with the preset rule or not again until the combined resource features are used as the features of the target model; otherwise, the binned resource features are then merged.

In one particular example, binning the second resource features may include the steps of: (1) equal frequency binning, e.g. into 5-8 bins. (2) The WOE of each bin is calculated. (3) If the WOE value conforms to a preset rule, such as increment, the feature encoding is finished, and the resource feature of each box is taken as the feature of the target model. (4) If the WOE value does not accord with the preset rule, performing a merging box, such as: combining age groups 15-24 and 24-30 into 15-30, and then performing (2) and (3).

In this implementation manner, the feature of the target model may be selected by performing binning on the second resource feature and calculating an evidence weight corresponding to the binned resource feature, so that monotonicity of the feature of the target model may be improved.

In some optional implementations of this embodiment, the preset rule includes one of the following: the evidence weight is increased, the evidence weight is decreased, the evidence weight is increased and then decreased, and the evidence weight is decreased and then increased.

In this implementation manner, the preset rule includes any one of an evidence weight increment, an evidence weight decrement, an evidence weight increment and then decrement, and an evidence weight decrement and then increment. The evidence weight increment may be sequentially increasing evidence weights corresponding to all the binned resource features.

In the implementation mode, the characteristics of the target model can be selected according to the preset rules, and the uniqueness of the characteristics of the target model is improved.

For ease of understanding, the following provides an application scenario in which the method for generating a model according to the embodiment of the present application may be implemented. As shown in fig. 5, the server 502 obtains the sample resource feature and the sample label corresponding to the sample resource feature from the terminal device 501 (step 503); then, the server 502 determines a first filtering factor according to the sample resource characteristics and the sample label, and determines a first resource characteristic from the sample resource characteristics according to the first filtering factor (step 504); then, the server 502 determines a second screening factor according to the parameters associated with the pre-trained logistic regression LR model, determines a second resource feature from the first resource features based on the second screening factor, and obtains the features of the target model based on the second resource feature (step 505); finally, the server 502 trains the machine learning model using the features of the target model as input of the target model and the sample labels corresponding to the features of the target model as output of the target model to obtain the target model (step 506).

With further reference to fig. 6, as an implementation of the method shown in the above figures, the present application provides an embodiment of an apparatus for generating a model, which corresponds to the embodiment of the method shown in fig. 2, and which can be applied in various electronic devices.

As shown in fig. 6, the apparatus 600 for generating a model according to the present embodiment may include: a sample acquisition module 601, a first determination module 602, a second determination module 603, and a model training module 604. The sample acquiring module 601 is configured to acquire a sample resource feature and a sample label corresponding to the sample resource feature; a first determining module 602 configured to determine a first filtering factor according to the sample resource characteristics and the sample label, and determine a first resource characteristic from the sample resource characteristics according to the first filtering factor; a second determining module 603 configured to determine a second filtering factor according to a parameter associated with the pre-trained logistic regression LR model, and determine a second resource feature from the first resource features based on the second filtering factor, and obtain a feature of the target model based on the second resource feature; and a model training module 604 configured to train the machine learning model to obtain the target model by using the features of the target model as the input of the target model and using the sample labels corresponding to the features of the target model as the output of the target model.

In the present embodiment, in the apparatus 600 for generating a model: the specific processing and the technical effects of the sample obtaining module 601, the first determining module 602, the second determining module 603, and the model training module 604 can refer to the related descriptions of step 201 and step 204 in the corresponding embodiment of fig. 2, and are not described herein again.

In some optional implementations of this embodiment, the first determining module 602 includes: a model obtaining unit (not shown in the figure) configured to train the decision tree XGBoost model based on the sample resource characteristics and the sample labels corresponding to the sample resource characteristics, so as to obtain the XGBoost model; a factor determination unit (not shown in the figures) configured to determine a first filtering factor based on a parameter associated with the XGBoost model.

In some optional implementations of the present embodiment, the parameters associated with the pre-trained XGBoost model include: coverage and correlation coefficient; and a factor determination unit further configured to: a first screening factor is determined based on the coverage and/or correlation coefficient.

In some optional implementations of this embodiment, the apparatus for generating a model further includes: a feature sorting module (not shown in the figure) configured to sort the sample resource features, resulting in sorted sample resource features; a first determination module 602, further configured to: and determining the first resource characteristics from the preset number of the sample resource characteristics in the sorted sample resource characteristics according to the first screening factor.

In some optional implementations of this embodiment, the feature ordering module is further configured to: and sequencing the sample resource features according to the feature importance of the sample resource features to obtain the sequenced sample resource features.

In some optional implementations of this embodiment, before determining the second filtering factor according to a parameter associated with the pre-trained logistic regression LR model, the means for generating a model further includes: a model derivation module (not shown in the figures) configured to train the logistic regression LR model based on the first resource feature and the sample label corresponding to the first resource feature, resulting in the LR model.

In some optional implementations of this embodiment, the parameters associated with the LR model include at least one of: the method comprises the following steps of variable coefficient, P value, information value, model component stability evaluation index and variance expansion coefficient, wherein the P value is a parameter for judging the testing result of a pre-trained LR model.

In some optional implementations of this embodiment, the apparatus for generating a model further includes: a first processing module (not shown in the figure), configured to bin the second resource features to obtain binned resource features, and determine evidence weights corresponding to the binned resource features; and the model serving module (not shown in the figure) is configured to respond that the change rule of the evidence weight corresponding to all the classified resource features conforms to a preset rule, and take the classified resource features as the features of the target model.

In some optional implementations of this embodiment, the apparatus for generating a model further includes: a second processing module (not shown in the figure), configured to combine the binned resource features and calculate evidence weights of the combined resource features in response to that the change rules of the evidence weights corresponding to all the binned resource features do not conform to the preset rules; a first as a module, further configured to: and responding to the change rule of the evidence weight of the combined resource features to accord with a preset rule, and taking the combined resource features as the features of the target model.

In some optional implementations of this embodiment, the apparatus for generating a model further includes: a parameter adjustment module (not shown in the figures) configured to adjust the hyper-parameters of the target model according to one of: grid search, random search and Bayesian optimization.

There is also provided, in accordance with an embodiment of the present application, an electronic device, a readable storage medium, and a computer program product.

FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 7, the device 700 comprises a computing unit 701, which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

Various components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the method of generating a model. For example, in some embodiments, the method of generating a model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When the computer program is loaded into the RAM 703 and executed by the computing unit 701, one or more steps of the method of generating a model as described above may be performed. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the method of generating a model.

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Artificial intelligence is the subject of studying computers to simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural voice processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

According to the technical scheme of the application, firstly, sample resource characteristics and sample labels corresponding to the sample resource characteristics are obtained; then, according to the sample resource characteristics and the sample label, determining a first screening factor, and according to the first screening factor, determining first resource characteristics from the sample resource characteristics; then determining a second screening factor according to parameters associated with a pre-trained logistic regression LR model, determining a second resource feature from the first resource feature based on the second screening factor, and obtaining the feature of the target model based on the second resource feature; and finally, taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training the machine learning model to obtain the target model. Furthermore, the method avoids the need of relying on a large amount of feature engineering, feature screening and model interpretability when determining the model entering features of the target model according to the LR model, thereby reducing the time and labor consumption.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of generating a model, comprising:

obtaining sample resource characteristics and sample labels corresponding to the sample resource characteristics;

determining a first screening factor according to the sample resource characteristics and the sample label, and determining first resource characteristics from the sample resource characteristics according to the first screening factor;

determining a second screening factor according to parameters associated with a pre-trained logistic regression LR model, determining a second resource feature from the first resource features based on the second screening factor, and obtaining the features of the target model based on the second resource feature;

and taking the characteristics of the target model as the input of the target model, taking the sample label corresponding to the characteristics of the target model as the output of the target model, and training a machine learning model to obtain the target model.

2. The method of claim 1, wherein said determining a first screening factor from said sample resource characteristics and said sample tags comprises:

training a decision tree XGboost model based on the sample resource characteristics and sample labels corresponding to the sample resource characteristics to obtain the XGboost model;

determining the first screening factor based on a parameter associated with the XGboost model.

3. The method of claim 2, wherein the parameters associated with the pre-trained XGBoost model comprise: coverage and correlation coefficient; and

determining the first screening factor based on parameters associated with the pre-trained XGboost model, including:

determining the first screening factor according to the coverage and/or the correlation coefficient.

4. The method of any of claims 1-3, further comprising:

sequencing the sample resource characteristics to obtain sequenced sample resource characteristics;

determining a first resource feature from the sample resource features, comprising:

and determining a first resource characteristic from a preset number of sample resource characteristics in the sorted sample resource characteristics according to the first screening factor.

5. The method of claim 4, wherein the sorting the sample resource features to obtain sorted sample resource features comprises:

and sequencing the sample resource features according to the feature importance of the sample resource features to obtain the sequenced sample resource features.

6. The method of claim 1, wherein prior to said determining a second screening factor from parameters associated with a pre-trained logistic regression LR model, further comprising:

and training a logistic regression LR model based on the first resource characteristics and the sample label corresponding to the first resource characteristics to obtain the LR model.

7. The method of claim 6, wherein the parameters associated with the LR model include at least one of:

the method comprises the following steps of variable coefficient, P value, information value, model component stability evaluation index and variance expansion coefficient, wherein the P value is a parameter for judging the testing result of the pre-trained LR model.

8. The method of claim 1, further comprising:

performing binning on the second resource features to obtain binned resource features, and determining evidence weights corresponding to the binned resource features;

and responding to the change rule of the evidence weight corresponding to all the classified resource features to accord with a preset rule, and taking the classified resource features as the features of the target model.

9. The method of claim 8, further comprising:

in response to that the change rules of the evidence weights corresponding to all the binned resource features do not accord with the preset rules, merging the binned resource features, and calculating the evidence weights of the merged resource features;

the method for responding to the change rule of the evidence weight corresponding to all the classified resource features conforms to a preset rule, and the classified resource features are used as the features of the target model, and the method comprises the following steps:

and in response to that the change rule of the evidence weight of the merged resource features conforms to the preset rule, taking the merged resource features as the features of the target model.

10. The method of claim 8 or 9, wherein the preset law comprises one of:

the evidence weight is increased, the evidence weight is decreased, the evidence weight is increased and then decreased, and the evidence weight is decreased and then increased.

11. The method of claim 1, further comprising:

adjusting a hyper-parameter of the target model according to one of: grid search, random search and Bayesian optimization.

12. The method of claim 1, wherein the sample resource characteristics comprise one of: sample image features, sample text features, sample speech features.

13. An apparatus for generating a model, comprising:

a sample acquisition module configured to acquire sample resource features and sample labels corresponding to the sample resource features;

a first determination module configured to determine a first filtering factor based on the sample resource characteristics and the sample tags, and determine a first resource characteristic from the sample resource characteristics based on the first filtering factor;

a second determination module configured to determine a second filtering factor according to a parameter associated with a pre-trained Logistic Regression (LR) model, and determine a second resource feature from the first resource features based on the second filtering factor, and derive a feature of the target model based on the second resource feature;

and the model training module is configured to train a machine learning model by taking the characteristics of the target model as the input of the target model and taking the sample labels corresponding to the characteristics of the target model as the output of the target model to obtain the target model.

14. The apparatus of claim 13, wherein the first determining means comprises:

the model obtaining unit is configured to train the decision tree XGboost model based on the sample resource characteristics and the sample labels corresponding to the sample resource characteristics to obtain the XGboost model;

a factor determination unit configured to determine the first filtering factor based on a parameter associated with the XGboost model.

15. The apparatus of claim 14, wherein the parameters associated with the pre-trained XGBoost model comprise: coverage and correlation coefficient; and

the factor determination unit is further configured to:

16. The apparatus of any of claims 13-15, further comprising:

the characteristic sorting module is configured to sort the sample resource characteristics to obtain sorted sample resource characteristics;

the first determination module further configured to: and determining a first resource characteristic from a preset number of sample resource characteristics in the sorted sample resource characteristics according to the first screening factor.

17. The apparatus of claim 16, wherein the feature ranking module is further configured to:

18. The apparatus of claim 13, wherein prior to said determining a second screening factor according to parameters associated with a pre-trained logistic regression LR model, further comprising:

a model obtaining module configured to train a Logistic Regression (LR) model based on the first resource feature and a sample label corresponding to the first resource feature to obtain the LR model.

19. The apparatus of claim 18, wherein the parameters associated with the LR model include at least one of:

20. The apparatus of claim 13, the apparatus further comprising:

the first processing module is configured to bin the second resource features to obtain the binned resource features, and determine evidence weights corresponding to the binned resource features;

and the model serving module is configured to respond that the change rule of the evidence weight corresponding to all the classified resource features accords with a preset rule, and take the classified resource features as the features of the target model.

21. The apparatus of claim 20, the apparatus further comprising:

the second processing module is configured to combine the binned resource features and calculate the evidence weights of the combined resource features in response to the fact that the change rules of the evidence weights corresponding to all the binned resource features do not accord with the preset rules;

the first acting module is further configured to:

22. The apparatus of claim 20 or 21, wherein the preset law comprises one of:

23. The apparatus of claim 13, the apparatus further comprising:

a parameter adjustment module configured to adjust a hyper-parameter of the target model according to one of: grid search, random search and Bayesian optimization.

24. The apparatus of claim 13, wherein the sample resource characteristics comprise one of: sample image features, sample text features, sample speech features.

25. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-12.

26. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-12.

27. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-12.