CN113077016A

CN113077016A - Redundant feature detection method, detection device, electronic apparatus, and medium

Info

Publication number: CN113077016A
Application number: CN202110492602.3A
Authority: CN
Inventors: 李策; 孔繁爽; 曹帅毅
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2021-05-06
Filing date: 2021-05-06
Publication date: 2021-07-06

Abstract

The disclosure provides a redundant feature detection method for a model, and relates to the field of artificial intelligence and the field of finance. The redundant feature detection method comprises the following steps: the method comprises the steps of obtaining a first feature set and a first index of a model to be evaluated, wherein the model to be evaluated comprises a model which is obtained after an initial model is trained on the basis of a training set by using the first feature set and a first machine learning algorithm and has first parameter information. And removing M features from the first feature set to obtain a second feature set. Obtaining a reconstructed model, wherein the reconstructed model comprises a model with second parameter information obtained after the initial model is trained on the basis of the training set by using the second feature set and the first machine learning algorithm. And when the ratio of the second index to the first index of the reconstructed model is greater than or equal to a first preset threshold value, determining that the M characteristics are redundant characteristics. The present disclosure also provides a redundant feature detection apparatus, an electronic device, and a storage medium.

Description

Redundant feature detection method, detection device, electronic apparatus, and medium

Technical Field

The present disclosure relates to the field of artificial intelligence and the field of finance, and more particularly, to a method and apparatus for detecting redundant features of a model, an electronic device, and a storage medium.

Background

Due to the unique advantages of the machine learning model in the fields of risk prevention and control, intelligent marketing and the like, more and more enterprises adopt the machine learning model in daily production. For example, in banking, the machine learning model can be applied to credit evaluation, risk review, smart marketing, and other businesses. When a machine learning model is established, a series of characteristics related to a predicted target, such as financial characteristics, credit characteristics, behavior characteristics and the like, need to be selected. The machine learning model is then trained such that the model learns the decision rules in the features, thereby enabling prediction of the target after training is completed. At present, various evaluation indexes are mainly adopted for evaluating the trained model, and the model is considered to have good effect if the indexes are good.

In the course of implementing the disclosed concept, the inventors found that there are at least the following problems in the prior art:

because the relevant evaluation indexes cannot reflect whether the model to be evaluated introduces redundant features or not, and an effective method for detecting the redundant features does not exist in the machine learning model after training is finished at present, the model has an overfitting risk, and the problem of calculation waste caused by processing the redundant features is possibly caused.

Disclosure of Invention

In view of the above, the embodiments of the present disclosure provide a method and an apparatus, and an electronic device and a storage medium, capable of performing redundant feature detection on a machine learning model.

One aspect of the disclosed embodiments provides a method for redundant feature detection of a model. The redundant feature detection method comprises the following steps: the method comprises the steps of obtaining a first feature set and a first index of a model to be evaluated, wherein the model to be evaluated comprises the model with first parameter information obtained after an initial model is trained on the basis of a training set by using the first feature set and a first machine learning algorithm, and the first index is used for representing the prediction performance of the model to be evaluated on a test set. And removing M features from the first feature set to obtain a second feature set, wherein M is an integer greater than or equal to 1. Obtaining a reconstructed model, wherein the reconstructed model comprises a model with second parameter information obtained after training the initial model based on the training set using the second feature set and the first machine learning algorithm. And when the ratio of a second index of the reconstructed model to the first index is greater than or equal to a first preset threshold value, determining that the M features are redundant features, wherein the second index is used for representing the prediction performance of the reconstructed model on the test set.

According to an embodiment of the present disclosure, the method includes repeating the operations of removing, obtaining a reconstruction model, and determining redundant features until all features in the first feature set are detected. Then, said removing M features from the first feature set to obtain a second feature set includes: and removing features from the first feature set one by one. When one characteristic is determined to be a non-redundant characteristic, the characteristic is placed back in the next detection process, and the operations of removing, obtaining a reconstruction model and determining the redundant characteristic are repeatedly executed.

According to an embodiment of the present disclosure, culling features one by one from the first feature set comprises: assigning an initial value of i to any numerical value in [1, N ], and circularly executing the following operations for N times, wherein N is the number of the features in the first feature set, and i is an integer greater than or equal to 1: and removing the ith feature from the first feature set to obtain the corresponding second feature set. And when the ith feature is determined to be a redundant feature, rejecting the ith feature and then enabling i to be i + 1. When the ith feature is determined to be a non-redundant feature, the ith feature is put back, and then i is made i + 1.

According to an embodiment of the present disclosure, after an ith feature is removed from the first feature set and the corresponding second feature set is obtained, the obtaining the reconstruction model includes: configuring a training process based on first hyper-parameter information of the model to be evaluated, and training the initial model based on the training set by using the second feature set and the first machine learning algorithm to obtain the reconstructed model.

According to an embodiment of the present disclosure, the removing M features from the first feature set to obtain a second feature set includes: obtaining a proxy model, wherein the proxy model comprises a model with third parameter information obtained after training the initial model based on the training set using the first feature set and a second machine learning algorithm. When the ratio of a third index of the proxy model to the first index is greater than or equal to the first preset threshold, obtaining the importance of each feature in the first feature set based on the second machine learning algorithm, wherein the third index is used for representing the predicted performance of the proxy model on the test set. Based on the importance of each feature, removing M features of which the importance is less than or equal to a second preset threshold value from the first feature set to obtain the second feature set.

According to an embodiment of the present disclosure, the obtaining a reconstructed model includes: and obtaining second hyper-parameter information, wherein the second hyper-parameter information is different from the first hyper-parameter information of the model to be evaluated. Configuring a training process based on the second hyper-parameter information, and training the initial model based on the training set by using the second feature set and the first machine learning algorithm to obtain the reconstructed model.

According to an embodiment of the present disclosure, the first feature set includes N features, where N is greater than M, and after the determining that the M features are redundant features, the method further includes: and when any feature is removed from the second feature set, and the ratio of the obtained second index of the reconstructed model to the obtained first index is smaller than the first preset threshold value, determining that the number of redundant features of the first feature set is M. And carrying out redundancy evaluation on the model to be evaluated based on the ratio of M to N.

Another aspect of the embodiments of the present disclosure provides a redundant feature detection apparatus for a model. The redundant feature detection device comprises an obtaining module, a removing module, a reconstructing module and a determining module. The obtaining module is used for obtaining a first feature set and a first index of a model to be evaluated, wherein the model to be evaluated comprises the model with first parameter information obtained after an initial model is trained on the basis of a training set by using the first feature set and a first machine learning algorithm, and the first index is used for representing the prediction performance of the model to be evaluated on a test set. The removing module is used for removing M features from the first feature set to obtain a second feature set, wherein M is an integer greater than or equal to 1. The reconstruction module is configured to obtain a reconstruction model, where the reconstruction model includes a model with second parameter information obtained after the initial model is trained based on the training set using the second feature set and the first machine learning algorithm. The determining module is configured to determine that the M features are redundant features when a ratio of a second index of the reconstructed model to the first index is greater than or equal to a first preset threshold, where the second index is used to characterize the predicted performance of the reconstructed model on the test set.

Another aspect of the disclosed embodiments provides an electronic device. The electronic device includes one or more memories, and one or more processors. The memory stores executable instructions. The processor executes the executable instructions to implement the method as described above.

Another aspect of the embodiments of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

Yet another aspect of an embodiment of the present disclosure provides a computer program product comprising computer programs/instructions which, when executed by a processor, implement the method as described above.

One or more of the above-described embodiments may provide the following advantages or benefits: the problem that redundant features of a machine learning model after training cannot be effectively detected can be at least partially solved, the initial model is trained on the basis of the training set by using the second feature set and the first machine learning algorithm, a reconstructed model with second parameter information is obtained, and then the prediction performance of the model to be evaluated and the prediction performance of the reconstructed model are compared to determine whether the features removed from the first feature set are the redundant features or not, so that the redundant features of the model to be evaluated can be effectively detected.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically illustrates an exemplary system architecture to which a redundant feature detection method may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow diagram of a method of redundant feature detection of a model according to an embodiment of the present disclosure;

FIG. 3 schematically illustrates a flow diagram of a method of redundant feature detection of a model according to another embodiment of the present disclosure;

FIG. 4 schematically illustrates a flow diagram of culling features from a first feature set one by one, according to an embodiment of the disclosure;

FIG. 5 schematically shows a flow chart for obtaining a second feature set according to an embodiment of the disclosure;

FIG. 6 schematically shows a flow chart for obtaining a reconstructed model according to an embodiment of the disclosure;

FIG. 7 schematically illustrates a flow diagram for redundancy evaluation of a model to be evaluated according to an embodiment of the present disclosure;

FIG. 8 schematically shows a block diagram of a redundant feature detection apparatus of a model according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates an operational flow diagram for redundant feature detection using a redundant feature detection apparatus according to an embodiment of the present disclosure; and

FIG. 10 schematically illustrates a block diagram of a computer system suitable for implementing the redundant feature detection method and apparatus according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The embodiment of the disclosure provides a model redundancy feature detection method. The redundant feature detection method comprises the following steps: the method comprises the steps of obtaining a first feature set and a first index of a model to be evaluated, wherein the model to be evaluated comprises a model with first parameter information obtained after an initial model is trained on the basis of a training set by using the first feature set and a first machine learning algorithm, and the first index is used for representing the prediction performance of the model to be evaluated on a test set. And removing M features from the first feature set to obtain a second feature set, wherein M is an integer greater than or equal to 1. And obtaining a reconstructed model, wherein the reconstructed model comprises a model with second parameter information obtained after the initial model is trained on the basis of the training set by using the second feature set and the first machine learning algorithm. And when the ratio of a second index to the first index of the reconstructed model is greater than or equal to a first preset threshold value, determining the M features as redundant features, wherein the second index is used for representing the prediction performance of the reconstructed model on the test set.

According to an embodiment of the present disclosure, machine learning model redundancy feature detection refers to, for example: the features used by the machine learning model are detected to check whether all of the features used by the machine learning model are necessary. If the machine learning model relies on a large number of unnecessary features (i.e., redundant features), the performance of the machine learning model may be disturbed, which may affect the use effect after actual delivery.

Specifically, for simple tasks, only a few features need to be selected to establish a simple model, and for complex tasks, a large number of features need to be selected from each dimension to establish a complex model. However, while complex models fit training data better, there is also a risk of over-fitting. Overfitting refers to the model over-learning the training data, learning rules specific to the training data, and such rules are not ubiquitous in the full volume of data. In other words, when features are redundant, the model is at risk of overfitting, which is temporarily not represented on the test set, but is more likely to occur when faced with more diverse data in the future, resulting in a degradation of the model's predictive performance.

By using the redundant feature detection method of the embodiment of the disclosure, the initial model is trained based on the training set by using the second feature set and the first machine learning algorithm to obtain the reconstructed model with the second parameter information, and then the prediction performance of the model to be evaluated is compared with that of the reconstructed model to determine whether the features removed from the first feature set are redundant features, so that the redundant features of the model to be evaluated can be effectively detected. After the redundant features are detected, the redundant features can be removed to optimize the model to be evaluated, for example, a reconstruction model can be used for replacing the model to be evaluated, so that the waste of computing power caused by processing the redundant features is avoided.

It should be noted that the method and the apparatus for detecting redundant features of a model according to the embodiments of the present disclosure may be used in the financial field, and may also be used in any field other than the financial field.

Fig. 1 schematically illustrates an exemplary system architecture 100 to which a redundant feature detection method may be applied, according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, the system architecture 100 according to this embodiment may include

terminal devices

101, 102, 103, a network 104 and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

The user may use the

terminal devices

101, 102, 103 to interact with the server 105 via the network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have installed thereon various communication client applications, such as shopping-like applications, web browser applications, search-like applications, instant messaging tools, mailbox clients, social platform software, etc. (by way of example only).

According to the embodiment of the disclosure, user data may be collected by the

terminal devices

101, 102, 103 and stored in the server 105 as a training set or a test set, wherein the collected user data may be analyzed, counted and processed from various dimensions to obtain a set including a plurality of features. The server 105 may use the first feature set and the first machine learning algorithm to train the initial model based on the training set, and then obtain the model to be evaluated having the first parameter information, and may also use the second feature set and the first machine learning algorithm to train the initial model based on the training set, and then obtain the reconstructed model having the second parameter information. After training is completed, the model to be evaluated can be used for predicting user data in the test set so as to obtain an index reflecting the performance of the model to be evaluated.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (for example only) providing support for websites browsed by users using the

terminal devices

101, 102, 103. The background management server may analyze and perform other processing on the received data such as the user request, and feed back a processing result (for example, a webpage, information, or data obtained or generated according to the user request) to the terminal device, for example, may feed back an evaluation index (a first index or a second index) of the test set predicted by the model to be evaluated or the reconstructed model to the terminal device.

It should be noted that the redundant feature detection method of the model provided by the embodiment of the present disclosure may be generally performed by the server 105. Accordingly, the redundant feature detection mechanism of the model provided by the embodiments of the present disclosure may be generally disposed in the server 105. The redundant feature detection method of the model provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105. Accordingly, the redundant feature detection apparatus of the model provided in the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the

terminal devices

101, 102, 103 and/or the server 105.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

FIG. 2 schematically shows a flow chart of a method of redundant feature detection of a model according to an embodiment of the present disclosure.

As shown in fig. 2, the redundant feature detection method of a model according to an embodiment of the present disclosure may include operations S210 to S240.

In operation S210, a first feature set and a first index of a model to be evaluated are obtained, where the model to be evaluated includes a model with first parameter information obtained after an initial model is trained based on a training set by using the first feature set and a first machine learning algorithm, and the first index is used to characterize the prediction performance of the model to be evaluated on a test set.

According to an embodiment of the disclosure, for example, taking a banking credit business risk management model as an example, the first feature set may include feature data obtained by processing financial data, industry and commerce data, credit investigation data, fund transaction data, public opinion data and the like, and the plurality of feature data may reflect personal financial attributes of a credit user.

When the risk management model is a binary model, for example, "0" is output if the user is a risk user, and "1" is output if the user is a normal user. Then the first indicator can be the auc (area Under cut) value of the model.

Where the risk management model is a regression model, the model may output a probability of breach for each customer, for example, if the probability of breach is greater than or equal to 60% representing a breached customer, the probability of breach is less than 50% representing a normal customer, and between 50% and 60% being a customer to be reviewed (by way of example only). Then its first indicator may be the mean square error value of the model.

According to an embodiment of the disclosure, the first machine learning algorithm may be a linear regression algorithm, a Logistic regression algorithm, a naive bayes algorithm, a KNN algorithm, a random forest algorithm, or the like.

According to the embodiment of the disclosure, besides the first feature set and the first index of the model to be evaluated, the structure information and the hyper-parameter information of the model to be evaluated can be obtained. The hyper-parameter information may include hyper-parameters set during model training, and the hyper-parameters related to different algorithms are different, for example, the hyper-parameters include a learning rate, a random number seed, a regularization term, and the like.

In operation S220, M features are removed from the first feature set to obtain a second feature set, where M is an integer greater than or equal to 1.

In operation S230, a reconstructed model is obtained, wherein the reconstructed model includes a model with second parameter information obtained after training the initial model based on the training set using the second feature set and the first machine learning algorithm.

According to the embodiment of the present disclosure, the first parameter information and the second parameter information are, for example, parameters of a mathematical function in a machine learning model, and a process of training the machine learning model is a process of determining the parameter information. The first parameter information and the second parameter information can process input data to obtain an output result when the model is predicted.

According to the embodiment of the present disclosure, when the model to be evaluated is a classification model, for example, the reconstruction model is a classification model accordingly. And when the model to be evaluated is the regression model, the reconstruction model is correspondingly the regression model. The difference between the reconstruction model and the model to be evaluated is as follows: because the features used in the training process are different, the parameter information obtained after the final training is completed may be different. The predicted performance of the two can be compared on this basis.

In operation S240, when a ratio of a second index of the reconstructed model to the first index is greater than or equal to a first preset threshold, the M features are determined to be redundant features, where the second index is used to characterize the prediction performance of the reconstructed model on the test set.

According to the embodiment of the present disclosure, for example, when the model to be evaluated is a binary model, the requirement may be satisfied when the ratio of AUC (reconstructed) to AUC (to be evaluated) is greater than or equal to 0.99 (i.e., a first preset threshold), that is, the predicted performance of the reconstructed model is smaller than the downslide amplitude of the model to be evaluated by less than 1% (for illustration only).

According to the embodiment of the disclosure, the initial model is trained on the basis of the training set by using the second feature set and the first machine learning algorithm to obtain the reconstruction model with the second parameter information, and then the prediction performance of the model to be evaluated is compared with that of the reconstruction model to determine whether the features removed from the first feature set are redundant features, so that whether unnecessary features are introduced into the model or not can be checked, whether the model has potential overfitting risks or not can be checked, and the problem of performance attenuation caused by overfitting after the model is deployed on line can be avoided. Since the processing of the redundant features calls for additional computational power, resulting in inefficient utilization of the computational power, the detection of the redundant features can be performed while checking whether the computational power utilization of the model is effective.

FIG. 3 schematically shows a flow chart of a redundant feature detection method of a model according to another embodiment of the present disclosure.

As shown in fig. 3, the method for detecting redundant features of a model according to an embodiment of the present disclosure includes repeatedly performing the operations of culling (e.g., operation S220), obtaining a reconstructed model (e.g., operation S230), and determining redundant features (e.g., operation S240) until all features in the first feature set are detected. The method may further include operations S310 to S330, in addition to operations S210, S230, and S240. Wherein operation S310 is one embodiment of operation S220.

In operation S310, features are culled one by one from the first feature set.

According to the embodiment of the disclosure, the second feature set can be obtained by randomly removing one set from the first feature set, or the second feature set can be obtained by sequentially removing one set from the first feature set.

In operation S320, it is determined whether a ratio of the second index to the first index of the reconstructed model is greater than or equal to a first preset threshold. When greater than the first preset threshold, operation S240 is performed. When less than the first preset threshold, operation S330 is performed.

In operation S240, when a ratio of a second index of the reconstructed model to the first index is greater than or equal to a first preset threshold, the M features are determined to be redundant features, where the second index is used to characterize the prediction performance of the reconstructed model on the test set. .

In operation S330, when a feature is determined to be a non-redundant feature, the rejecting (e.g., operation S310), obtaining a reconstruction model (e.g., operation S230), and determining a redundant feature (e.g., operation S240) are repeatedly performed after the feature is put back in the next detection process.

According to an embodiment of the present disclosure, the first set of characteristics of the banking credit business risk management model, for example, may be [ financial data, business data, credit data, funding transaction data, public opinion data ]. First, the financial data is eliminated, and the second feature set can be [ business data, credit data, fund transaction data, public opinion data ]. And then, training the initial model by using the second feature set to obtain a reconstructed model, wherein if the performance of the reconstructed model is greatly reduced compared with the model to be evaluated (for example, most of the risk users are identified as normal users), the financial data are non-redundant features. Then, the financial data is put back and the rest of the undetected characteristics are eliminated in [ financial data, business data, credit data, fund transaction data, public opinion data ].

Fig. 4 schematically shows a flowchart of individually culling features from the first feature set in operation S310 according to an embodiment of the present disclosure.

As shown in fig. 4, the initial value of i may be assigned to any value within [1, N ], and then operations S410 to S440 are performed in a loop until N times, where N is the number of features in the first feature set and i is an integer greater than or equal to 1.

In operation S410, the ith feature is removed from the first feature set, and a corresponding second feature set is obtained.

According to embodiments of the present disclosure, financial data is removed from [ financial data, business data, credit data, fund transaction data, public opinion data ], for example. In some embodiments of the present disclosure, the initial value of i may also be assigned to a random value in [1, N ], for example, i ═ 3, and then the credit investigation data is removed first.

In operation S420, it is determined whether the ith feature is a redundant feature, and when the ith feature is not a redundant feature, operation S430 is performed, and when the ith feature is a redundant feature, operation S440 is performed.

In operation S430, when it is determined that the ith feature is a non-redundant feature, the ith feature is put back, and then i is made i + 1.

According to the embodiment of the disclosure, when the financial data is a non-redundant feature, the financial data, the business data, the credit data, the fund transaction data and the public opinion data are removed again.

In operation S440, when the ith feature is determined to be a redundant feature, the ith feature is culled, and then i +1 is set.

According to the embodiment of the disclosure, when the financial data is a redundant feature, the industrial and commercial data is continuously removed from [ the industrial and commercial data, the credit data, the fund transaction data and the public opinion data ].

Further, when the business data is of a redundant characteristic, the credit data can be continuously removed from the [ credit data, fund transaction data and public opinion data ].

In some embodiments of the present disclosure, i ═ i + s may be given, where s is a random number, i.e., any of the remaining features may be randomly culled.

By using the redundant feature detection method disclosed by the embodiment of the disclosure, the features can be detected one by one, and the contribution of each feature to the model to be evaluated can be determined. When one feature is removed, the obtained prediction performance of the reconstruction model does not slide down, and the feature is a redundant feature and does not contribute to the prediction performance of the model to be evaluated, so that the subsequent processing of the model to be evaluated is prevented from being wasted.

According to the embodiment of the disclosure, the step of removing the ith feature from the first feature set and obtaining the corresponding second feature set to obtain the reconstruction model comprises the following steps: and configuring a training process based on the first hyper-parameter information of the model to be evaluated, and training the initial model based on the training set by using the second feature set and the first machine learning algorithm to obtain a reconstructed model.

According to the embodiment of the disclosure, the reconstruction model and the model to be evaluated are trained by using the same hyper-parameter information, so that the removed features are used as the difference of the two, the function of each feature can be determined by comparing the predictive performances of the two, and the reference can be provided for the subsequent training of the models with similar functions.

Fig. 5 schematically shows a flowchart of obtaining the second feature set in operation S220 according to an embodiment of the present disclosure.

As shown in fig. 5, the operation S220 of removing M features from the first feature set to obtain the second feature set may include operations S510 to S530.

In operation S510, a proxy model is obtained, wherein the proxy model includes a model having third parameter information obtained after training an initial model based on a training set using a first feature set and a second machine learning algorithm.

According to the embodiment of the disclosure, the second machine learning algorithm may be a strong machine learning algorithm capable of obtaining feature importance, such as a random forest algorithm, a gradient lifting tree algorithm, and the like. In some embodiments of the present disclosure, the first machine learning algorithm may also be a strong machine learning algorithm capable of obtaining feature importance, such as a random forest algorithm, a gradient lifting tree algorithm, and the like, and at this time, the second machine learning algorithm may be the same as the first machine learning algorithm.

In operation S520, when a ratio of a third index of the proxy model to the first index is greater than or equal to a first preset threshold, obtaining an importance of each feature in the first feature set based on a second machine learning algorithm, where the third index is used to characterize a predicted performance of the proxy model for the test set.

According to an embodiment of the present disclosure, the proxy model is a proxy for the model to be evaluated, from the same training set, with the same inputs and outputs. Therefore, in the process of training on the same training set to obtain the proxy model, the third index of the proxy model is close to the first index of the model to be evaluated (i.e. the ratio of the third index to the first index is greater than or equal to the first preset threshold). When a satisfactory proxy model is obtained, the importance of each feature is obtained.

In operation S530, based on the importance of each feature, M features having an importance less than or equal to a second preset threshold are removed from the first feature set to obtain a second feature set.

According to the embodiment of the present disclosure, by looking at the importance of each feature in the proxy model, for example, a second preset threshold may be set to 0 (i.e., as an example), that is, a feature with an importance of 0 is removed from the first feature set, and then the second feature set is obtained.

By using the redundant feature detection method of the embodiment of the disclosure, because the prediction performance of the proxy model is similar to that of the model to be evaluated, the importance of each feature can reflect the contribution of the model to be evaluated, so that at least one feature with smaller importance is removed, and the feature removal efficiency can be improved.

Fig. 6 schematically shows a flowchart of obtaining a reconstructed model in operation S230 according to an embodiment of the present disclosure.

As shown in fig. 6, obtaining the reconstructed model in operation S230 may include operations S610 to S620.

In operation S610, second hyper-parameter information is obtained, where the second hyper-parameter information is different from the first hyper-parameter information of the model to be evaluated.

In operation S620, a training process is configured based on the second hyper-parameter information, and the initial model is trained based on the training set using the second feature set and the first machine learning algorithm, so as to obtain a reconstructed model.

According to the embodiment of the disclosure, after the satisfactory proxy model is obtained, a plurality of features with smaller importance may be removed from the first feature set. In the process of obtaining the reconstructed model by using the second feature set, training the initial model by using the first hyper-parameter information may result in a reduction in training efficiency, and even the obtained reconstructed model may not meet requirements, and an erroneous redundant feature detection result may occur. Therefore, the learning ability of the reconstruction model is adjusted by obtaining the second hyper-parameter information, and the training efficiency can be improved.

FIG. 7 schematically shows a flow diagram for redundancy evaluation of a model to be evaluated according to an embodiment of the present disclosure.

As shown in fig. 7, performing redundancy evaluation on the model to be evaluated may include operations S710 to S720.

In operation S710, when any feature is removed from the second feature set, and a ratio of a second index to a first index of the obtained reconstructed model is smaller than a first preset threshold, it is determined that the number of redundant features of the first feature set is M.

According to the embodiment of the present disclosure, for example, from [ financial data, business data, credit data, fund transaction data, public opinion data ], it is determined that the business data and the public opinion data are redundant data. After any characteristic is removed from [ financial data, credit investigation data and fund transaction data ], the obtained prediction performance of the reconstruction model has severe downslide compared with the model to be evaluated, and then the number of the redundant characteristics can be determined to be 2.

In operation S720, redundancy evaluation is performed on the model to be evaluated based on the ratio of M to N.

According to an embodiment of the present disclosure, the redundancy evaluation of the model to be evaluated may be calculated using the following formula:

wherein Redundancy represents Redundancy, M is the number of redundant features, and N is the number of features in the first feature set.

When the redundant feature is 2 and the number of features in the first feature set is 5, the redundancy of the risk management model is 0.4.

According to another embodiment of the disclosure, taking the banking credit business risk management model as an example, whether the target customer has the default risk or not is predicted, which may be a two-classification model. For example, when the model is trained on a training set, the data features used include 45 features (just an example), such as financial data, business data, credit data, fund transaction data, and public opinion data. Then, the model predicts on the test set, and an AUC value is calculated according to the prediction result, for example, the first index AUC value is 0.89.

First, feature reduction is used, i.e., features are culled one by one on a first set of features. And after each feature is eliminated, calculating the variation condition of the second index of the reconstructed model on the test set, and if the variation is minimum, determining that the second index is a redundant feature. The above process is repeated until after any feature is removed, the AUC value of the second index is slid down to 0.8811, and the magnitude of the slide-down is close to 1% (for example only). In the process, 4 characteristics such as the turnover rate of the mobile assets and the total number of debt strokes are respectively eliminated, and the redundancy of the calculation model is about 0.09.

Then, a proxy model method, i.e., a second machine learning algorithm is used to obtain a proxy model, and the importance of each feature is obtained. Based on these 45 features, the hyper-parameters are adjusted to achieve a model result AUC value of 0.89, for example using a gradient lifting tree algorithm for modeling. And then looking up the feature importance of the model, and finding that the feature importance of 2 features, namely the turnover rate of the flowing assets and the total number of debts, is 0, rejecting the two features, retraining the model (namely obtaining a reconstructed model) by using the remaining features, wherein if the obtained AUC value can reach above 0.8811, the difference between the reconstructed model after feature rejection and the model to be evaluated can be ensured not to exceed 1%, considering the two features as redundant features, and calculating the redundancy of the model to be 0.04.

Finally, as can be seen from the above, 4 features can be removed by the feature reduction method, and two features can be removed by the proxy model method. Therefore, the model to be evaluated may eventually be replaced by a reconstructed model corresponding to the feature reduction method.

In some embodiments of the present disclosure, after the redundant features are determined using the proxy model method, the feature subtraction method may be further used to further detect the remaining features one by one.

By using the redundancy feature detection method disclosed by the embodiment of the disclosure, the model can be evaluated from the redundancy of the model, a decision basis is provided for the operation management of the model, and the problems of model attenuation risk and performance waste caused by feature redundancy are avoided.

Fig. 8 schematically shows a block diagram of a redundant feature detection apparatus 800 of a model according to an embodiment of the present disclosure.

As shown in fig. 8, the redundant feature detection apparatus 800 may include an obtaining module 810, a culling module 820, a reconstructing module 830, and a determining module 840.

The obtaining module 810 may perform operation S210, for example, to obtain a first feature set and a first index of a model to be evaluated, where the model to be evaluated includes a model with first parameter information obtained after an initial model is trained based on a training set by using the first feature set and a first machine learning algorithm, and the first index is used to characterize the prediction performance of the model to be evaluated on a test set.

The culling module 820 may perform operation S220, for example, to cull M features from the first feature set to obtain a second feature set, where M is an integer greater than or equal to 1.

The reconstruction module 830 may perform operation S230, for example, for obtaining a reconstructed model, wherein the reconstructed model includes a model with second parameter information obtained after training the initial model based on the training set using the second feature set and the first machine learning algorithm.

The determining module 840 may perform operation S240, for example, to determine that the M features are redundant features when a ratio of a second index of the reconstructed model to the first index is greater than or equal to a first preset threshold, where the second index is used to characterize the predicted performance of the reconstructed model on the test set.

According to the embodiment of the present disclosure, the culling module 820 may further perform operations S310 and S320, for example, to cull features from the first feature set one by one, and when a feature is determined to be a non-redundant feature, repeatedly perform the operations of culling, obtaining a reconstruction model, and determining a redundant feature after the feature is put back in a next detection process.

According to the embodiment of the disclosure, the culling module 820 may further assign an initial value of i to any value within [1, N ], and loop through operations S410 to S440 for N times, where N is the number of features in the first feature set, and i is an integer greater than or equal to 1. Specifically, the ith feature is removed from the first feature set, and a corresponding second feature set is obtained. And when the ith feature is determined to be a redundant feature, rejecting the ith feature and enabling i to be i + 1. When the ith feature is determined to be a non-redundant feature, the ith feature is put back, and then i is made i + 1.

According to an embodiment of the disclosure, the culling module 820 may further perform operations S510 to S230, for example, to obtain a proxy model, where the proxy model includes a model with third parameter information obtained after training the initial model based on the training set using the first feature set and the second machine learning algorithm. And when the ratio of a third index of the proxy model to the first index is greater than or equal to a first preset threshold value, obtaining the importance of each feature in the first feature set based on a second machine learning algorithm, wherein the third index is used for representing the predicted performance of the proxy model on the test set. Based on the importance of each feature, M features with the importance smaller than or equal to a second preset threshold are removed from the first feature set to obtain a second feature set.

According to an embodiment of the present disclosure, the reconstruction module 830 may be further configured to remove the ith feature from the first feature set, configure a training process based on the first hyper-parameter information of the model to be evaluated after obtaining the corresponding second feature set, train the initial model based on the training set using the second feature set and the first machine learning algorithm, and obtain the reconstruction model.

According to an embodiment of the present disclosure, the reconstruction module 830 may further perform operations S610 to S620, for example, to obtain second hyper-parameter information, where the second hyper-parameter information is different from the first hyper-parameter information of the model to be evaluated. And configuring a training process based on the second hyper-parameter information, and training the initial model based on the training set by using a second feature set and a first machine learning algorithm to obtain a reconstructed model.

According to an embodiment of the present disclosure, the redundant feature detection apparatus 800 may further include a redundancy evaluation module. The redundancy evaluation module may perform operations S710 to S720, for example, to determine that the number of redundant features in the first feature set is M when any feature is removed from the second feature set and a ratio of a second index to a first index of the obtained reconstructed model is smaller than a first preset threshold. And performing redundancy evaluation on the model to be evaluated based on the ratio of M to N.

The detailed flow of the redundant feature detection using the redundant feature detection apparatus 800 is described in detail below with reference to fig. 9.

Fig. 9 schematically illustrates an operational flow diagram for redundant feature detection using a redundant feature detection apparatus 800 according to an embodiment of the present disclosure.

As shown in fig. 9, the redundant feature detection using the redundant feature detection apparatus 800 may include operations S910 to S980.

In operation S910, a feature set (e.g., a first feature set) of a model to be evaluated, a training set of the model to be evaluated, an index (e.g., a first index) of the model to be evaluated, and a test set of the model to be evaluated may be obtained through the redundant feature detection apparatus 800. Then, the redundant feature detection may be performed by selectively using a feature reduction method and a proxy model method, respectively, where the selecting a feature reduction rule performs operation S930 and the selecting a proxy model rule performs operation S920.

In operation S920, it may be determined whether the first machine learning algorithm has a feature importance level, and if the first machine learning algorithm has the feature importance level, operation S922 is performed, and if the first machine learning algorithm has no feature importance level, operation S921 is performed.

In operation S921, if the first machine learning algorithm cannot obtain the feature importance, an agent model may be obtained by using algorithms such as random forest and gradient boosting tree, which may specifically refer to operation S510.

In operation S922, if the first machine learning algorithm can obtain the feature importance, the feature importance in the first feature set is directly obtained. Otherwise, each feature importance is obtained from the agent model, which may specifically refer to operation S520.

In operation S923, features with importance greater than or equal to a preset threshold, for example, M features greater than 0, may be obtained, specifically refer to operation S530.

In operation S930, features are removed from the first feature set one by using a feature reduction method, which may specifically refer to operation S310.

In operation S940, a second feature set is obtained.

In operation S950, a reconstructed model is obtained based on the second feature set, which may specifically refer to operation S230.

In operation S960, it is determined whether the prediction performance of the reconstructed model with respect to the model to be evaluated is degraded, for example, whether a ratio of a second index to a first index of the reconstructed model is greater than or equal to a first preset threshold may be compared.

For the reconstructed model obtained by using the feature subtraction method, when the prediction performance is significantly reduced, a feature is determined to be a non-redundant feature, and the operations S930, S940, S950 and S960 are repeatedly performed after the feature is replaced in the next detection process. And when the prediction performance is not significantly degraded, operation S970 is performed.

For the reconstructed model obtained by the proxy model method, operation S970 is performed when the prediction performance is not significantly degraded.

In operation S970, if a redundant feature is determined according to the feature subtraction method, operations S930 (removing undetected features), S940, S950, and S960 may be repeated in the next detection process until all features in the first feature set are detected. If M redundant features are determined according to the proxy model method, the number of the redundant features can be directly determined.

In operation S980, the number of redundant features is determined based on the result of operation S970, and the redundancy evaluation of the model to be evaluated is performed, which may specifically refer to operations S710 to S720.

Any number of modules, sub-modules, units, sub-units, or at least part of the functionality of any number thereof according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, sub-modules, units, and sub-units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, sub-modules, units, sub-units according to embodiments of the present disclosure may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in any other reasonable manner of hardware or firmware by integrating or packaging a circuit, or in any one of or a suitable combination of software, hardware, and firmware implementations. Alternatively, one or more of the modules, sub-modules, units, sub-units according to embodiments of the disclosure may be at least partially implemented as a computer program module, which when executed may perform the corresponding functions.

For example, any of the obtaining module 810, the culling module 820, the reconstructing module 830, and the determining module 840 may be combined into one module to be implemented, or any one of the modules may be split into multiple modules. Alternatively, at least part of the functionality of one or more of these modules may be combined with at least part of the functionality of the other modules and implemented in one module. According to an embodiment of the disclosure, at least one of the obtaining module 810, the culling module 820, the reconstructing module 830, and the determining module 840 may be implemented at least in part as a hardware circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented in hardware or firmware in any other reasonable manner of integrating or packaging a circuit, or may be implemented in any one of three implementations of software, hardware, and firmware, or in a suitable combination of any of them. Alternatively, at least one of the obtaining module 810, the culling module 820, the reconstructing module 830 and the determining module 840 may be at least partially implemented as a computer program module, which when executed may perform a corresponding function.

FIG. 10 schematically illustrates a block diagram of a computer system suitable for implementing the redundant feature detection method and apparatus according to an embodiment of the present disclosure. The computer system illustrated in FIG. 10 is only one example and should not impose any limitations on the scope of use or functionality of embodiments of the disclosure.

As shown in fig. 10, a computer system 1000 according to an embodiment of the present disclosure includes a processor 1001 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)1002 or a program loaded from a storage section 1008 into a Random Access Memory (RAM) 1003. Processor 1001 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1001 may also include onboard memory for caching purposes. The processor 1001 may include a single processing unit or multiple processing units for performing different actions of a method flow according to embodiments of the present disclosure.

In the RAM 1003, various programs and data necessary for the operation of the system 1000 are stored. The processor 1001, ROM 1002, and RAM 1003 are connected to each other by a bus 1004. The processor 1001 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1002 and/or the RAM 1003. Note that the program may also be stored in one or more memories other than the ROM 1002 and the RAM 1003. The processor 1001 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in one or more memories.

System 1000 may also include an input/output (I/O) interface 1005, the input/output (I/O) interface 1005 also being connected to bus 1004, according to an embodiment of the present disclosure. The system 1000 may also include one or more of the following components connected to the I/O interface 1005: an input section 1006 including a keyboard, a mouse, and the like; an output section 1007 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 1008 including a hard disk and the like; and a communication section 1009 including a network interface card such as a LAN card, a modem, or the like. The communication section 1009 performs communication processing via a network such as the internet. The driver 1010 is also connected to the I/O interface 1005 as necessary. A removable medium 1011 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 1010 as necessary, so that a computer program read out therefrom is mounted into the storage section 1008 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication part 1009 and/or installed from the removable medium 1011. The computer program performs the above-described functions defined in the system of the embodiment of the present disclosure when executed by the processor 1001. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

Embodiments of the present disclosure also include a computer program product comprising a computer program containing program code for performing the method provided by the embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code being adapted to cause the electronic device to carry out the method for redundant feature detection of a model provided by the embodiments of the present disclosure.

The computer program, when executed by the processor 1001, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted in the form of a signal on a network medium, distributed, downloaded and installed via the communication part 1009, and/or installed from the removable medium 1011. The computer program containing program code may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language includes, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method of redundant feature detection for a model, comprising:

obtaining a first feature set and a first index of a model to be evaluated, wherein the model to be evaluated comprises a model with first parameter information obtained after an initial model is trained on the basis of a training set by using the first feature set and a first machine learning algorithm, and the first index is used for representing the prediction performance of the model to be evaluated on a test set;

removing M features from the first feature set to obtain a second feature set, wherein M is an integer greater than or equal to 1;

obtaining a reconstructed model, wherein the reconstructed model comprises a model with second parameter information obtained after the initial model is trained based on the training set by using the second feature set and the first machine learning algorithm;

and when the ratio of a second index of the reconstructed model to the first index is greater than or equal to a first preset threshold value, determining that the M features are redundant features, wherein the second index is used for representing the prediction performance of the reconstructed model on the test set.

2. The redundant feature detection method of claim 1 wherein,

the method comprises the steps of repeatedly executing the operations of removing, obtaining a reconstruction model and determining redundant features until all the features in the first feature set are detected;

then, said removing M features from the first feature set to obtain a second feature set includes: removing features from the first feature set one by one;

when one characteristic is determined to be a non-redundant characteristic, the characteristic is placed back in the next detection process, and the operations of removing, obtaining a reconstruction model and determining the redundant characteristic are repeatedly executed.

3. A method of redundant feature detection according to claim 2 wherein culling features one by one from the first set of features comprises:

assigning an initial value of i to any numerical value in [1, N ], and circularly executing the following operations for N times, wherein N is the number of the features in the first feature set, and i is an integer greater than or equal to 1:

removing the ith feature from the first feature set to obtain a corresponding second feature set;

when the ith feature is determined to be a redundant feature, rejecting the ith feature, and then enabling i to be i + 1;

when the ith feature is determined to be a non-redundant feature, the ith feature is put back, and then i is made i + 1.

4. The redundant feature detection method according to claim 3, wherein the obtaining the reconstruction model after removing an ith feature from the first feature set and obtaining the corresponding second feature set comprises:

configuring a training process based on first hyper-parameter information of the model to be evaluated, and training the initial model based on the training set by using the second feature set and the first machine learning algorithm to obtain the reconstructed model.

5. The redundant feature detection method of claim 1 wherein said culling M features from said first set of features to obtain a second set of features comprises:

obtaining a proxy model, wherein the proxy model comprises a model with third parameter information obtained after training the initial model based on the training set by using the first feature set and a second machine learning algorithm;

when the ratio of a third index of the proxy model to the first index is greater than or equal to the first preset threshold, obtaining the importance of each feature in the first feature set based on the second machine learning algorithm, wherein the third index is used for representing the predicted performance of the proxy model on the test set;

based on the importance of each feature, removing M features of which the importance is less than or equal to a second preset threshold value from the first feature set to obtain the second feature set.

6. The redundant feature detection method of claim 5 wherein said obtaining a reconstruction model comprises:

obtaining second hyper-parameter information, wherein the second hyper-parameter information is different from the first hyper-parameter information of the model to be evaluated;

configuring a training process based on the second hyper-parameter information, and training the initial model based on the training set by using the second feature set and the first machine learning algorithm to obtain the reconstructed model.

7. The redundant feature detection method of claim 1 wherein said first set of features includes N features, where N is greater than M, said method further comprising, after said determining said M features are redundant features:

when any feature is removed from the second feature set, and the ratio of the obtained second index of the reconstructed model to the first index is smaller than the first preset threshold value, determining the number of redundant features of the first feature set to be M; and

and carrying out redundancy evaluation on the model to be evaluated based on the ratio of M to N.

8. An apparatus for redundant feature detection of a model, comprising:

the model to be evaluated comprises a model with first parameter information obtained after an initial model is trained on the basis of a training set by using the first feature set and a first machine learning algorithm, and the first index is used for representing the prediction performance of the model to be evaluated on a test set;

a removing module, configured to remove M features from the first feature set to obtain a second feature set, where M is an integer greater than or equal to 1;

a reconstruction module, configured to obtain a reconstruction model, where the reconstruction model includes a model with second parameter information obtained after the initial model is trained based on the training set using the second feature set and the first machine learning algorithm;

and a determining module, configured to determine that the M features are redundant features when a ratio of a second index of the reconstructed model to the first index is greater than or equal to a first preset threshold, where the second index is used to characterize the prediction performance of the reconstructed model on the test set.

9. An electronic device, comprising:

one or more memories storing executable instructions; and

one or more processors executing the executable instructions to implement the method of any one of claims 1-7.

10. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.