CN113902563A

CN113902563A - Method, device, computer equipment and medium for updating tree model by equivalent interval

Info

Publication number: CN113902563A
Application number: CN202111067120.XA
Authority: CN
Inventors: 孙鹏
Original assignee: Nanjing Xingyun Digital Technology Co Ltd
Current assignee: Nanjing Xingyun Digital Technology Co Ltd
Priority date: 2021-09-13
Filing date: 2021-09-13
Publication date: 2022-01-07

Abstract

The application relates to a method, a device, computer equipment and a medium for updating a tree model of an equivalent interval. The method comprises the following steps: obtaining a test set, wherein the test set comprises customer characteristic data of each customer in a current customer group; obtaining an equivalent interval table corresponding to the target tree model; calculating to obtain the value of the guest group displacement index corresponding to the test set according to the equivalent interval table; verifying whether the passenger group displacement index is related to the prediction accuracy of the target tree model, if so, monitoring the value of the passenger group displacement index, and updating the target tree model when the monitored value of the passenger group displacement index exceeds a preset passenger group displacement threshold value. By adopting the method, the change of a plurality of client characteristics can be converted into a single and concise measurement value to judge whether the client group characteristics have displacement related to the model prediction accuracy, so that the tree model can be updated in time, and the prediction accuracy of the tree model is ensured.

Description

Method, device, computer equipment and medium for updating tree model by equivalent interval

Technical Field

The present application relates to the field of model updating technologies, and in particular, to a method, an apparatus, a computer device, and a medium for updating a tree model in an equivalent interval.

Background

The field of financial anti-fraud often employs decision tree-based machine learning models (i.e., tree models) to predict fraud risk for customers. However, over time, the prediction accuracy of any tree model for a new customer population inevitably degrades for two main reasons: the first is that a new customer group which is not seen before a tree model appears on the characteristic distribution of the customer group, and the second is that the characteristic distribution of the customer group is stable, but the corresponding relation between the characteristics of the customer group and the label is changed. If the second reason is that the client label is observed, the client label is not known until 1-3 months after the tree model gives the prediction score (for example, a payment overdue label of the client in the field of financial anti-fraud), and if the first reason is that the tree model is attenuated, the tree model can be judged in advance by detecting the change of the client group, so that the model retrain can be triggered early to improve the prediction accuracy of the model without waiting for the label to be obtained.

The following methods are currently used to measure customer characteristics and predict score changes: the method comprises the steps of calculating a stability index (PSI) of each feature; calculating PSI of important features, wherein the important features refer to a plurality of features which have the highest contribution to model accuracy; and the third method is to directly calculate the change of the prediction score, such as directly calculating the PSI of the prediction score, or calculating whether the prediction score is obviously improved or reduced by adopting a resampling method. The method can only measure the change of a single characteristic, cannot simply and intuitively measure the displacement of the whole customer group, and neglects the influence of the characteristic on the prediction score (the change of the characteristic does not necessarily cause the change of the prediction score); although the second method is related to the prediction score, the possibility that the overall customer group displacement is not related to the prediction score still exists, more than one important feature of the model is usually provided, and the calculation is complex; the third method can detect the change of the prediction score, but the change of the prediction score does not necessarily mean the appearance of a new customer group, but may be the result that a certain identified customer group (such as a high-risk customer group) appears a lot in a certain time period, so the change of the prediction score cannot sufficiently and necessarily indicate the accuracy change of the model and the appearance of the new customer group.

Disclosure of Invention

Therefore, it is necessary to provide a method, an apparatus, a computer device, and a medium for updating a tree model in an equivalent interval, which can convert a large amount of client feature changes into a single and concise measurement value to determine whether a client group feature has a displacement related to model prediction accuracy, so as to update the tree model in time and ensure the prediction accuracy of the tree model.

A first aspect of the present application provides a method for updating a tree model by an equivalent interval, where the method includes:

obtaining a test set, wherein the test set comprises customer characteristic data of each customer in a current customer group, and the customer characteristic data of each customer comprises values corresponding to a plurality of customer characteristics of the customer;

acquiring an equivalent interval table corresponding to a target tree model, wherein the equivalent interval table comprises all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval, and each equivalent interval consists of a value range corresponding to each client feature;

calculating to obtain a value of a guest group displacement index corresponding to the test set according to the equivalent interval table, wherein the guest group displacement index is used for measuring displacement which is generated on each customer characteristic and is related to a risk prediction score of the current customer group relative to the customer group in the training set;

verifying whether the passenger group displacement index is related to the prediction accuracy of the target tree model, if so, monitoring the value of the passenger group displacement index, and updating the target tree model when the monitored value of the passenger group displacement index exceeds a preset passenger group displacement threshold value.

In some embodiments, before obtaining the equivalent interval table corresponding to the target tree model, the method includes:

acquiring a segmentation value of each client feature appearing in a training set of a target tree model;

determining all equivalent intervals corresponding to the target tree model according to the segmentation value of each client feature;

and establishing an equivalent interval table corresponding to the target tree model according to all equivalent intervals and the risk prediction scores uniquely corresponding to each equivalent interval.

In some embodiments, determining all equivalent intervals corresponding to the target tree model according to the segmentation value of each client feature includes:

determining a value range corresponding to each client characteristic according to the segmentation value of each client characteristic, wherein the number of the value ranges corresponding to each client characteristic is multiple;

and determining all equivalent intervals corresponding to the target tree model according to the value range corresponding to each client feature, wherein the number of all equivalent intervals is equal to the product of the number of the value ranges corresponding to all client features, and the value ranges included in any two equivalent intervals in all equivalent intervals are not identical.

In some embodiments, determining a value range corresponding to each client feature according to the segmentation value of each client feature includes:

removing the duplication of the segmentation value of each client characteristic to obtain a target segmentation value of each client characteristic; the number of target segmentation values of the kth customer feature among all the customer features is Q_k；

When the target tree model is a tree model which cannot process null values, dividing (- ∞, + ∞) into a plurality of value ranges according to the target segmentation value of each client characteristic, and taking the plurality of divided value ranges as the value ranges corresponding to each client characteristic; the number of the value ranges corresponding to the kth customer characteristic in all the customer characteristics is Q_k+1；

When the target tree model is a tree model capable of processing null values, dividing (— infinity, + infinity) into a plurality of value ranges according to a target segmentation value corresponding to each client feature, setting a value range for representing the null value, and taking the divided value ranges and the value range for representing the null value as the value range corresponding to each client feature; the number of the value ranges corresponding to the kth customer characteristic in all the customer characteristics is Q_k+2。

In some embodiments, the calculating the value of the guest group displacement index corresponding to the test set according to the equivalent interval table includes:

determining the total number N of the clients of the current client group;

determining the number M of new customers in the current customer group, wherein the value of part or all of the customer characteristics of any new customer does not belong to any equivalent interval in the equivalent interval table;

and calculating to obtain the value of the guest group displacement index corresponding to the test set according to the number M of the new customers and the total number N of the customers, wherein the value of the guest group displacement index is M/N.

In some embodiments, verifying whether the guest population displacement indicator is related to the prediction accuracy of the target tree model comprises:

performing risk prediction according to the client characteristic data of each client in the current client group included in the test set through the target tree model to obtain a risk prediction score of each client;

acquiring an actual risk label of each client, and acquiring the prediction accuracy of the target tree model for the test set according to the actual risk label and the risk prediction score of each client;

and verifying whether the guest group displacement index is negatively correlated with the prediction accuracy, if so, judging that the guest group displacement index is related to the prediction accuracy of the target tree model, and if not, judging that the guest group displacement index is not related to the prediction accuracy of the target tree model.

In some of these embodiments, the method further comprises: when the target tree model is set, setting the number of trees in the target tree model not to exceed a preset number threshold; and/or setting the depth of all trees in the target tree model not to exceed a preset depth threshold; and/or setting the digit precision of the segmentation value corresponding to each client characteristic in all the client characteristics not to exceed the preset decimal point post-digit.

A second aspect of the present application provides an apparatus for updating a tree model in an equivalent interval, the apparatus comprising:

the test set acquisition module is used for acquiring a test set, wherein the test set comprises client characteristic data of each client in a current client group, and the client characteristic data of each client comprises values corresponding to a plurality of client characteristics of the client;

the equivalent interval table acquisition module is used for acquiring an equivalent interval table corresponding to the target tree model, the equivalent interval table comprises all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval, and each equivalent interval consists of a value range corresponding to each client feature;

the passenger group displacement index calculation module is used for calculating a value of a passenger group displacement index corresponding to the test set according to the equivalent interval table, and the passenger group displacement index is used for measuring the displacement which is generated on each customer characteristic and is related to the risk prediction score of the current customer group relative to the customer group in the training set;

and the tree model updating module is used for verifying whether the passenger group displacement index is related to the prediction accuracy of the target tree model, monitoring the value of the passenger group displacement index if the passenger group displacement index is related to the prediction accuracy of the target tree model, and updating the target tree model when the passenger group displacement index exceeds a preset passenger group displacement threshold value after monitoring.

A third aspect of the application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the embodiments of the method when executing the computer program.

A fourth aspect of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of an embodiment of any of the methods described above.

The method, the device, the computer equipment and the medium for updating the tree model in the equivalent interval acquire a test set, wherein the test set comprises client characteristic data of each client in a current client group; obtaining an equivalent interval table corresponding to the target tree model, wherein the equivalent interval table comprises all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval; calculating to obtain the value of the guest group displacement index corresponding to the test set according to the equivalent interval table; verifying whether the passenger group displacement index is related to the prediction accuracy of the target tree model, if so, monitoring the value of the passenger group displacement index, and updating the target tree model when the monitored value of the passenger group displacement index exceeds a preset passenger group displacement threshold value. The method can convert a large amount of client characteristic changes into a single concise measurement value based on the specific equivalent interval concept of the tree model to judge whether the client group characteristics are displaced relative to the model prediction accuracy, for example, in the field of financial anti-fraud, a financial anti-fraud label is usually required to be obtained (for example, an overdue label) within a period of time (for example, 1-3 months or even longer) after the tree model gives a prediction score, that is, a corresponding label is required to be obtained after a period of time in the past, whether the tree model prediction accuracy is reduced is checked according to the obtained label, if the tree model prediction accuracy is reduced, the tree model needs to be retrained to ensure the prediction accuracy, and the method provided by the embodiment of the invention can be used for detecting whether a tree model generates a decay signal in advance (the tree model prediction accuracy is reduced), if the tree model generates an attenuation signal, the model retraining can be triggered without waiting for obtaining a label, so that the tree model can be updated in time, and the prediction accuracy of the tree model is ensured.

Drawings

FIG. 1 is a diagram of an exemplary application environment for a method for updating a tree model for equivalent intervals;

FIG. 2 is a flowchart illustrating a method for updating a tree model for equivalent intervals according to an embodiment;

FIG. 3 is a flowchart illustrating the steps of creating an equivalence interval table in one embodiment;

FIG. 4 is a flowchart illustrating the steps of determining an equivalence interval in one embodiment;

FIG. 5 is a first tree of an example of a decision tree model in one embodiment;

FIG. 6 is a second tree of an example of a decision tree model in one embodiment;

FIG. 7 is a flowchart illustrating the steps of calculating a passenger group displacement indicator according to one embodiment;

FIG. 8 is a graph illustrating an example of the relationship between a passenger group displacement indicator and prediction accuracy in one embodiment;

FIG. 9 is a diagram illustrating another example of the relationship between the passenger group displacement indicator and the prediction accuracy in one embodiment;

FIG. 10 is a block diagram showing an example of an apparatus for updating a tree model of an equivalent interval in one embodiment;

FIG. 11 is a diagram illustrating an internal structure of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

Example one

The method for updating the tree model of the equivalent interval can be applied to the application environment shown in fig. 1. The risk prediction system 102 is in communication with the plurality of terminals 104 through a network, a target tree model for performing risk prediction on clients can be run in the risk prediction system 102, the risk prediction system 102 can obtain client feature data of each client in a current client group from the plurality of terminals 104, use the client feature data as a test set and obtain an equivalent interval table corresponding to the target tree model, the equivalent interval table corresponding to the target tree model can be a hash mapping table which is pre-calculated and stored in a memory or a remote storage server of the risk prediction system 102, and specifically includes all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval, and each equivalent interval is composed of a value range corresponding to each client feature; the risk prediction system 102 calculates a value of a guest group displacement index corresponding to the test set according to the equivalent interval table, where the guest group displacement index may be used to measure a guest group displacement related to the risk prediction score, verify whether the guest group displacement index is related to the prediction accuracy of the target tree model, monitor the value of the guest group displacement index if yes, and update the target tree model when the monitored value of the guest group displacement index exceeds a preset guest group displacement threshold. The risk prediction system 102 may be implemented by an independent server or a server cluster composed of a plurality of servers, and the terminal 104 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.

In the present embodiment, as shown in fig. 2, a method for updating a tree model of an equivalent interval is provided, which is described by taking the method as an example applied to the risk prediction system 102 in fig. 1, and includes the following steps:

step 100, obtaining a test set, where the test set includes customer feature data of each customer in a current customer group, and the customer feature data of each customer includes values corresponding to a plurality of customer features of the customer.

The test set may be regarded as a customer characteristic data set of a current customer group, and the current customer group may be all customers who perform business transactions within a selected time period, where the selected time period may be a time period from a certain past time point to the present time, such as the last months, weeks, days, or the like, or a time period in the past, such as 2021 year and 3 months. Further, in this step, the client characteristics generally refer to client attribute information having an influence on the risk prediction result of the client, such as age, income level, working age or occupation type, each client generally has a plurality of client characteristics, and each piece of client characteristic data includes a value corresponding to each client characteristic of the client since each client corresponds to one piece of client characteristic data.

Specifically, risk prediction system 102 obtains customer characteristic data for each customer in the current customer population as a test set. In order to ensure that the test set has sufficient data volume when the method is implemented, customer characteristic data within at least one month is generally selected as the test set.

200, acquiring an equivalent interval table corresponding to the target tree model, wherein the equivalent interval table comprises all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval, and each equivalent interval is composed of a value range corresponding to each client feature.

Wherein the target tree model is a decision tree model being used by the risk prediction system 102, and the decision tree model is used for risk prediction of the client; under different application scenarios, the risk prediction system 102 may use different decision tree models according to different requirements, and the equivalent interval tables corresponding to the different decision tree models generally have different contents; the training set is client characteristic data used in training the target tree model and comprises client characteristic data of a large number of clients; all the equivalent intervals included in the equivalent interval table are determined based on the training set, each equivalent interval uniquely corresponds to one risk prediction score, and the risk prediction scores corresponding to any two equivalent intervals in all the equivalent intervals are different.

In some embodiments, as shown in fig. 3, before obtaining the equivalent interval table corresponding to the target tree model, the method further includes a step of creating the equivalent interval table, specifically including the following steps:

step 210, obtaining a segmentation value of each client feature appearing in the training set of the target tree model.

The number of the client features appearing in the training set of the target tree model is multiple, and the target tree model has a plurality of segmentation nodes according to the characteristics of the tree model, wherein each segmentation node uses the client feature appearing in one training set and the segmentation value of the client feature.

In particular, risk prediction system 102 may collect a customer characteristic and a segmentation value for the customer characteristic for each segmentation node in the target tree model, assuming that the customer characteristic for the target tree model is F₁、F₂、……、F_MWherein M represents the number of client features, and the division value of each client feature appearing in all the division nodes is de-duplicated to obtain different division values of the client feature, which will be referred to as the target division value of the client feature hereinafter

Wherein Q is_kRepresenting the number of different segmentation values that occur for the kth client feature.

And step 220, determining all equivalent intervals corresponding to the target tree model according to the segmentation value of each client feature, wherein each equivalent interval uniquely corresponds to one risk prediction score.

In some embodiments, as shown in fig. 4, step 220 comprises the steps of:

step 2201, determining a value range corresponding to each client feature according to the segmentation value of each client feature, wherein the number of the value ranges corresponding to each client feature is multiple.

Specifically, step 2201 includes the steps of:

removing the duplication of the segmentation value of each client characteristic to obtain a target segmentation value of each client characteristic; the number of target segmentation values of the kth customer feature among all the customer features is Q_k。

When the target tree model is a tree model which cannot process null values, dividing (- ∞, + ∞) into a plurality of value ranges according to the target segmentation value of each client characteristic, and taking the plurality of divided value ranges as the value ranges corresponding to each client characteristic; the number of the value ranges corresponding to the kth customer characteristic in all the customer characteristics is Q_k+1。

In the above example, the target division value of the k-th customer feature is assumed to be

Are ordered from small to large, then Q can be based on the above_kConstructing a value range corresponding to the kth customer characteristic by each target segmentation value:

because the target tree model adopted here cannot process null values, the value range representing the null values is not included, and the number of the value ranges corresponding to the kth client feature is Q_k+1。

In practical application, many tree model algorithms can automatically process null values, so that the value of the client feature can be a specific valueThe numerical value of (2) can also be a null value, and because the null value has different trends in different segmentation nodes, a value range must be additionally set for the null value, namely the null value is an isolated value range. In some embodiments, when the target tree model is a tree model capable of handling null values, dividing (— ∞, + ∞) into a plurality of value ranges according to a target segmentation value corresponding to each client feature, setting a value range for representing null values, and taking the divided value ranges and the value range for representing null values as the value range corresponding to each client feature; the number of the value ranges corresponding to the kth customer characteristic in all the customer characteristics is Q_k+2。

Step 2202, determining all equivalent intervals corresponding to the target tree model according to the value range corresponding to each client feature, wherein the number of all equivalent intervals is equal to the product of the numbers of the value ranges corresponding to all client features, and the value ranges included in any two equivalent intervals in all equivalent intervals are not completely the same.

In the following example, the determination process of the equivalent interval is described in detail, in an example of a decision tree model shown in fig. 5 and fig. 6, the decision tree model uses two trees, where three customer features F1, F2, and F3 are provided in total, that is, M is 3, 10 segmentation nodes are provided in total on the two trees, and after the segmentation values of the customer features are reordered, the different segmentation values of F1 have: 1. 3, 3.1, F2 have different partition values of 0.5,1.5, 2.5, F3 has different partition values of: 0. 2 and 4.

Assuming that the above-mentioned segmentation values do not occur in any of the client feature data in all possible data sets (training set and test set), for example: if the value of the client characteristic is an integer, the division values are selected to be the decimal numbers of 0.5,1.5 and 2.5, so that the client characteristic data in the data set can be ensured not to collide with the division values. If the value of the client feature is fractional, then extending the number of bits of the segmentation value sufficiently can also ensure this. In actual implementation, many tree models can automatically ensure that the segmentation values do not appear in the dataset, such as an xgboost model. Based on the above assumptions, as for any client feature data, as long as the value range of each client feature is determined, regardless of the specific value of each client feature in the client feature data, the risk prediction score output by the target tree model is fixed.

Taking the decision tree models shown in fig. 5 and fig. 6 as an example, when the values of F1, F2, and F3 of a piece of client feature data are respectively from a value range (1,3) of F1, a value range (0.5,1.5) of F2, and a value range (0,2) of F3, the risk prediction score of the client is uniquely determined regardless of the specific values of F1, F2, and F3. It can be seen that when the value range corresponding to each client feature is determined, there is no additional partition value, so that the prediction will go to different branches at any partition node, and therefore, when the value ranges corresponding to each client feature of a client are all determined, the prediction will go to a uniquely given leaf node. For example, if one piece of customer feature data is a leaf node that must be pointed to score 0.051 in the first tree and a leaf node that must be pointed to score 0 in the second tree, the average prediction probability of the two trees is e if a logit transform under xgboost is applied^{[(0.051+0)/2]}/{1+e^(0.051+0)/20.5067496 (i.e., risk prediction score).

In summary, the tree model is characterized by dividing the feature space into finite shares, each of which corresponds to a unique risk prediction score. In the example shown in fig. 5 and 6, the two trees divide the feature space into 4 × 4 — 64 equivalent intervals determined by the value ranges of the three features, and each equivalent interval corresponds to a unique risk prediction score.

And step 230, creating an equivalent interval table corresponding to the target tree model according to all the equivalent intervals and the risk prediction scores uniquely corresponding to each equivalent interval.

The equivalent interval table can adopt a hash mapping table form, so that reading and calculation are facilitated, and the overall data processing efficiency can be improved.

Specifically, because the segmentation nodes, the segmentation rules, and the scores of the leaf nodes of the tree model are determined by the training set, even if the trained tree model makes each piece of client feature data in the training set belong to different equivalent intervals, the upper limit of the number of equivalent intervals obtained based on the training set does not exceed Nt, Nt is the number of samples in the training set, and the number of all the obtained equivalent intervals is assumed to be R, and the number of the obtained equivalent intervals R is usually less than Nt. Therefore, a hash mapping table with R keys can be created, wherein the R key values represent R equivalent intervals, and each equivalent interval corresponds to a unique fixed risk prediction score. For example, as shown in fig. 5 and fig. 6, when the values of F1, F2, and F3 of a piece of client feature data come from a value range (1,3) of F1, a value range (0.5,1.5) of F2, and a value range (0,2) of F3, respectively, the key in the hash map is: "(1, 3) × (0.5,1.5) × (0, 2)", the corresponding value is the risk prediction score 0.5067496.

And 300, calculating to obtain the value of the guest group displacement index corresponding to the test set according to the equivalent interval table.

The customer group displacement index is used for measuring displacement which is generated on each customer characteristic and is related to the risk prediction score of the current customer group relative to the customer group in the training set, and the value of the customer group displacement index corresponding to any test set is a concise measurement value which is the change of the customer group characteristic related to the prediction score.

Specifically, risk prediction system 102 may calculate, based on the hash table, how many proportions of the customer characteristic data in the test set have keys in the hash table, and how many proportions are in the new validity interval.

In some embodiments, as shown in fig. 7, step 300 comprises the steps of:

step 310, determining the total number N of the clients of the current client group;

step 320, determining the number M of new clients in the current client group, wherein the value of part or all of the client characteristics of any new client does not belong to any equivalent interval in the equivalent interval table;

and step 330, calculating to obtain a value of the guest group displacement index corresponding to the test set according to the number M of the new customers and the total number N of the customers, wherein the value of the guest group displacement index is M/N.

Specifically, the displacement index P _ new of the customer base is set to be M/N, where N is the total number of customers of the current customer base, that is, the number of samples in the test set, and M is customer feature data that does not belong to an equivalent interval in which the training set has appeared among the N customers, and here, a customer whose corresponding customer feature data does not belong to any equivalent interval in the equivalent interval table is regarded as a new customer. If P _ new is 0, it indicates that the equivalent interval that all clients in the current client group belong to has already appeared in the training set, and if P _ new is 1, it indicates that all clients in the current client group do not belong to the equivalent interval that any training set has already covered.

And 400, verifying whether the passenger group displacement index is related to the prediction accuracy of the target tree model, if so, monitoring the value of the passenger group displacement index, and updating the target tree model when the monitored value of the passenger group displacement index exceeds a preset passenger group displacement threshold value.

The passenger group displacement threshold value can be a threshold value preset according to actual needs, when the value of the passenger group displacement index exceeds the preset passenger group displacement threshold value, the prediction accuracy of the target tree model can be considered to be lower than expected, and at the moment, the target tree model is retrained, so that the target tree model is updated.

In some embodiments, step 400 includes a step of verifying whether the guest group displacement indicator is related to the prediction accuracy of the target tree model, and specifically includes the following steps:

and performing risk prediction through the target tree model according to the client characteristic data of each client in the current client group included in the test set to obtain a risk prediction score of each client.

And acquiring the actual risk label of each client, and acquiring the prediction accuracy of the target tree model for the test set according to the actual risk label and the risk prediction score of each client.

The following describes the verification process of the correlation between the displacement index of the guest group and the prediction accuracy in detail by taking two sets of data as an example.

The training set of the first group takes the customer characteristic data (about 30 ten thousand pieces of data) in six months, the testing set takes the customer characteristic data (about 30 ten thousand pieces of data) in another six months, the starting month and the ending month of the testing set are separated by 1 year, and the testing set is based on the xgboost modeling (Gradient Boosting model) based on 25 customer characteristics. As shown in the table of fig. 8, the phenomena of significantly low ROC (receiver operating characteristics) occur in the fourth month and the sixth month (i.e. the 16 th month and the 18 th month after the training set terminating month), and the P _ new actually reflects that the proportion of the customer characteristic data that is not seen in the training set in the two months is high. Since P _ new can be obtained prior to the actual risk label, it can be seen that P _ new has a warning effect on model attenuation at this time.

The training set of the second group took customer characteristics data for six months (about 300 ten thousand data), the test set took customer characteristics data for another six months (about 300 ten thousand data), the test set starting month and the training set ending month were separated by 1 year, and then modeled based on xgboost based on 45 customer characteristics. As shown in the table of fig. 9, it can be seen that the model shows a more severe decay from the 16 th month of training set termination, while P _ new also ascertains the new customer scale up (except at the 17 th month). This indicates that, although P _ new cannot predict the model decay 100%, the increase in P _ new is still precautionary.

In summary, if P _ new is related to the prediction accuracy of the target tree model, P _ new can be set as the attenuation pre-signal of the target tree model, i.e. the signal indicating the prediction accuracy of the target tree model is decreased, so as to detect the attenuation of the target tree model in advance, where advance refers to before obtaining the label.

It should be noted that when applying the guest cluster displacement index P _ new, attention is paid to the complexity of the control target tree model, otherwise P _ new may be close to 1 and lose the indication meaning. Therefore, in some embodiments, the method further includes a step of setting a target tree model, specifically including:

when the target tree model is set, the number of trees in the target tree model is set to be not more than a preset number threshold (in this way, the number of equivalent intervals can be reduced by controlling the number of trees, so that the indicating effect of the guest group displacement index P _ new is ensured, and meanwhile, the calculation efficiency of the target tree model is improved.

And/or setting the depth of all trees in the target tree model to be not more than a preset depth threshold (in the case that model undersitting is not serious, setting each tree as the minimum depth as possible).

And/or setting the digit precision of the segmentation value corresponding to each client feature in all the client features not to exceed the preset digit number after the decimal point (for example, the digit number of the decimal point is not increased from the decimal point of the segmentation value to the two after the decimal point, so as to reduce the number of equivalent intervals).

In the actual implementation process, the parameters of the target tree model may be manually intervened, a preset number threshold, a preset depth threshold and the like are set, for example, the depths of all the trees do not exceed 8, the number of the trees does not exceed 50, and the like, and then the parameters are automatically adjusted, so that the optimal parameters are selected without exceeding the upper limit.

In the method for updating a tree model in an equivalent interval provided in this embodiment, a risk prediction system obtains a test set, where the test set includes client feature data of each client in a current client group; obtaining an equivalent interval table corresponding to the target tree model, wherein the equivalent interval table comprises all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval; calculating to obtain the value of the guest group displacement index corresponding to the test set according to the equivalent interval table; verifying whether the passenger group displacement index is related to the prediction accuracy of the target tree model, if so, monitoring the value of the passenger group displacement index, and updating the target tree model when the monitored value of the passenger group displacement index exceeds a preset passenger group displacement threshold value. The method can convert a large amount of client characteristic changes into a single concise measurement value based on the specific equivalent interval concept of the tree model to judge whether the client group characteristics are displaced relative to the model prediction accuracy, for example, in the field of financial anti-fraud, a financial anti-fraud label is usually required to be obtained (for example, an overdue label) within a period of time (for example, 1-3 months or even longer) after the tree model gives a prediction score, that is, a corresponding label is required to be obtained after a period of time in the past, whether the tree model prediction accuracy is reduced is checked according to the obtained label, if the tree model prediction accuracy is reduced, the tree model needs to be retrained to ensure the prediction accuracy, and the method provided by the embodiment of the invention can be used for detecting whether a tree model generates a decay signal in advance (the tree model prediction accuracy is reduced), if the tree model generates an attenuation signal, the model retraining can be triggered without waiting for obtaining a label, so that the tree model can be updated in time, and the prediction accuracy of the tree model is ensured.

It should be understood that although the various steps in the flow charts of fig. 2-5 are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least some of the steps in fig. 2-5 may include multiple sub-steps or multiple stages that are not necessarily performed at the same time, but may be performed at different times, and the order of performance of the sub-steps or stages is not necessarily sequential, but may be performed in turn or alternating with other steps or at least some of the sub-steps or stages of other steps.

Example two

In one embodiment, as shown in fig. 10, there is provided an apparatus for updating a tree model of an equivalent interval, including:

the test set obtaining module 110 is configured to obtain a test set, where the test set includes customer feature data of each customer in a current customer group, and the customer feature data of each customer includes values corresponding to a plurality of customer features of the customer.

The equivalent interval table obtaining module 120 is configured to obtain an equivalent interval table corresponding to the target tree model, where the equivalent interval table includes all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval, and each equivalent interval is composed of a value range corresponding to each client feature.

And the guest group displacement index calculation module 130 is configured to calculate a value of a guest group displacement index corresponding to the test set according to the equivalent interval table, where the guest group displacement index is used to measure a displacement, which is generated on each customer feature and is related to the risk prediction score, of the current customer group relative to the customer group in the training set.

The tree model updating module 140 is configured to verify whether the guest group displacement index is related to the prediction accuracy of the target tree model, monitor the value of the guest group displacement index if the guest group displacement index is related to the prediction accuracy of the target tree model, and update the target tree model when it is monitored that the value of the guest group displacement index exceeds a preset guest group displacement threshold.

In some embodiments, the apparatus further includes an equivalent interval table creating module, configured to obtain a segmentation value of each client feature appearing in a training set of the target tree model; determining all equivalent intervals corresponding to the target tree model according to the segmentation value of each client characteristic, wherein each equivalent interval only corresponds to one risk prediction score; and establishing an equivalent interval table corresponding to the target tree model according to all equivalent intervals and the risk prediction scores uniquely corresponding to each equivalent interval.

In some embodiments, the equivalent interval table creating module is specifically configured to determine, according to the segmentation value of each client feature, a value range corresponding to each client feature, where the number of the value ranges corresponding to each client feature is multiple; and determining all equivalent intervals corresponding to the target tree model according to the value range corresponding to each client feature, wherein the number of all equivalent intervals is equal to the product of the number of the value ranges corresponding to all client features, and the value ranges included in any two equivalent intervals in all equivalent intervals are not identical.

In some embodiments, the equivalent interval table creating module is further specifically configured to perform deduplication on the segmentation value of each client feature to obtain a target segmentation value of each client feature; the number of target segmentation values of the kth customer feature in all the customer features is Q _ k; when the target tree model is a tree model which cannot process null values, dividing (- ∞, + ∞) into a plurality of value ranges according to the target segmentation value of each client characteristic, and taking the plurality of divided value ranges as the value ranges corresponding to each client characteristic; the number of the value ranges corresponding to the kth customer characteristic in all the customer characteristics is Q _ k + 1; when the target tree model is a tree model capable of processing null values, dividing (— infinity, + infinity) into a plurality of value ranges according to a target segmentation value corresponding to each client feature, setting a value range for representing the null value, and taking the divided value ranges and the value range for representing the null value as the value range corresponding to each client feature; the number of the value ranges corresponding to the kth customer characteristic in all the customer characteristics is Q _ k + 2.

In some embodiments, the customer group displacement index calculation module 130 is specifically configured to determine the total number N of customers of the current customer group; determining the number M of new customers in the current customer group, wherein the value of part or all of the customer characteristics of any new customer does not belong to any equivalent interval in the equivalent interval table; and calculating to obtain the value of the guest group displacement index corresponding to the test set according to the number M of the new customers and the total number N of the customers, wherein the value of the guest group displacement index is M/N.

In some embodiments, the tree model updating module 140 may be further configured to perform risk prediction according to the client feature data of each client in the current client group included in the test set through the target tree model, so as to obtain a risk prediction score of each client; acquiring an actual risk label of each client, and acquiring the prediction accuracy of the target tree model for the test set according to the actual risk label and the risk prediction score of each client; and verifying whether the guest group displacement index is negatively correlated with the prediction accuracy, if so, judging that the guest group displacement index is related to the prediction accuracy of the target tree model, and if not, judging that the guest group displacement index is not related to the prediction accuracy of the target tree model.

In some embodiments, the apparatus further comprises a tree model setting module, configured to set, when setting the target tree model, that the number of trees in the target tree model does not exceed a preset number threshold; and/or setting the depth of all trees in the target tree model not to exceed a preset depth threshold; and/or setting the digit precision of the segmentation value corresponding to each client characteristic in all the client characteristics not to exceed the preset decimal point post-digit.

For the specific definition of the apparatus for updating the tree model in the equivalent interval, refer to the above definition of the method for updating the tree model in the equivalent interval, and are not described herein again. The modules in the apparatus for updating the tree model of equivalent intervals may be implemented in whole or in part by software, hardware, or a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

EXAMPLE III

In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 11. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a nonvolatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The processor executes the computer program to implement a method for updating a tree model of equivalent intervals as described in the first embodiment. Those skilled in the art will appreciate that the architecture shown in fig. 11 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Example four

In this embodiment, a computer-readable storage medium is provided, on which a computer program is stored, and the computer program, when executed by a processor, implements a method for updating a tree model of an equivalent interval as described in the first embodiment. It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the embodiments provided herein may include non-volatile and/or volatile memory, among others. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A method for updating a tree model for an equivalent interval, the method comprising:

obtaining a test set, wherein the test set comprises customer characteristic data of each customer in a current customer group, and the customer characteristic data of each customer comprises values corresponding to a plurality of customer characteristics of the customer respectively;

obtaining an equivalent interval table corresponding to a target tree model, wherein the equivalent interval table comprises all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval, and each equivalent interval consists of a value range corresponding to each client feature;

calculating to obtain a value of a guest group displacement index corresponding to the test set according to the equivalent interval table, wherein the guest group displacement index is used for measuring displacement which is generated on each customer feature and is related to a risk prediction score of the current customer group relative to the customer group in the training set;

verifying whether the guest group displacement index is related to the prediction accuracy of the target tree model, if so, monitoring the value of the guest group displacement index, and updating the target tree model when the monitored value of the guest group displacement index exceeds a preset guest group displacement threshold value.

2. The method of claim 1, wherein before obtaining the equivalent interval table corresponding to the target tree model, the method comprises:

acquiring a segmentation value of each client feature appearing in the training set of the target tree model;

and creating an equivalent interval table corresponding to the target tree model according to the all equivalent intervals and the risk prediction scores uniquely corresponding to each equivalent interval.

3. The method according to claim 2, wherein the determining all equivalent intervals corresponding to the target tree model according to the segmentation value of each client feature comprises:

and determining all equivalent intervals corresponding to the target tree model according to the value range corresponding to each client feature, wherein the number of all equivalent intervals is equal to the product of the numbers of the value ranges corresponding to all client features, and the value ranges included in any two equivalent intervals in all equivalent intervals are not identical.

4. The method according to claim 3, wherein the determining a value range corresponding to each client feature according to the segmentation value of each client feature comprises:

removing the duplication of the segmentation value of each client characteristic to obtain a target segmentation value of each client characteristic; the number of target segmentation values of the kth customer feature in all the customer features is Q_k；

In the target treeWhen the model is a tree model capable of processing null values, dividing (— infinity, + ∞) into a plurality of value ranges according to a target segmentation value corresponding to each client characteristic, setting a value range for representing the null value, and taking the divided value ranges and the value range for representing the null value as the value range corresponding to each client characteristic; the number of the value ranges corresponding to the kth customer characteristic in all the customer characteristics is Q_k+2。

5. The method according to any one of claims 1 to 4, wherein the calculating the value of the guest group displacement index corresponding to the test set according to the equivalent interval table includes:

determining the total number N of the clients of the current client group;

determining the number M of new customers in the current customer population, wherein the value of part or all of the customer characteristics of any new customer does not belong to any equivalent interval in the equivalent interval table;

and calculating to obtain a value of the guest group displacement index corresponding to the test set according to the number M of the new customers and the total number N of the customers, wherein the value of the guest group displacement index is M/N.

6. The method of any one of claims 1 to 4, wherein the verifying whether the guest group displacement indicator is related to the prediction accuracy of the target tree model comprises:

acquiring an actual risk label of each client, and obtaining the prediction accuracy of the target tree model for the test set according to the actual risk label and the risk prediction score of each client;

7. The method of claim 2, further comprising:

when the target tree model is set, setting that the number of trees in the target tree model does not exceed a preset number threshold;

and/or setting the depth of all trees in the target tree model not to exceed a preset depth threshold;

and/or setting the digit precision of the segmentation value corresponding to each client feature in all the client features not to exceed the preset decimal place back digit.

8. An apparatus for updating a tree model for equivalent intervals, the apparatus comprising:

the system comprises a test set acquisition module, a test set acquisition module and a test set analysis module, wherein the test set acquisition module is used for acquiring a test set, the test set comprises customer characteristic data of each customer in a current customer group, and the customer characteristic data of each customer comprises values corresponding to a plurality of customer characteristics of the customer;

an equivalent interval table obtaining module, configured to obtain an equivalent interval table corresponding to a target tree model, where the equivalent interval table includes all equivalent intervals determined based on a training set of the target tree model and a risk prediction score uniquely corresponding to each equivalent interval, and each equivalent interval is composed of a value range corresponding to each client feature;

the passenger group displacement index calculation module is used for calculating a value of a passenger group displacement index corresponding to the test set according to the equivalent interval table, and the passenger group displacement index is used for measuring displacement which is generated on each customer characteristic and is related to a risk prediction score of the current customer group relative to the customer group in the training set;

and the tree model updating module is used for verifying whether the guest group displacement index is related to the prediction accuracy of the target tree model, monitoring the value of the guest group displacement index if the guest group displacement index is related to the prediction accuracy of the target tree model, and updating the target tree model when the guest group displacement index exceeds a preset guest group displacement threshold value.

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method of any of claims 1 to 7 are implemented when the computer program is executed by the processor.

10. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.