CN114266655A

CN114266655A - Wind control model construction method and device based on reinforcement learning

Info

Publication number: CN114266655A
Application number: CN202210178571.9A
Authority: CN
Inventors: 王静; 董文涛; 武靖
Original assignee: Beijing Weijuzhihui Technology Co ltd; Beijing Weiju Future Technology Co ltd
Current assignee: Beijing Weijuzhihui Technology Co ltd; Beijing Weiju Future Technology Co ltd
Priority date: 2022-02-25
Filing date: 2022-02-25
Publication date: 2022-04-01

Abstract

The invention discloses a method and a device for building a wind control model based on reinforcement learning, and belongs to the technical field of personal credit wind control. The method comprises the following steps: receiving a credit request of a user, acquiring data required by a wind control decision, and verifying the data; processing data, constructing a state vector, and generating a return prediction network according to the state vector and a predefined action space by combining a reinforcement learning model; calculating to obtain the prediction return after each action in the action space is implemented, and selecting and implementing the action with the maximum expected return by adopting a preset search strategy; and calculating the real return after the action is implemented according to the actual repayment result of the user, and updating the parameters of the return prediction network by taking return maximization as a target according to the real return and the prediction return. The invention can lead the optimization target of the wind control model to be consistent with the commercial target and can rapidly deal with the change of industry or market.

Description

Wind control model construction method and device based on reinforcement learning

Technical Field

The invention relates to the technical field of personal credit wind control, in particular to a method and a device for constructing a wind control model based on reinforcement learning.

Background

At present, in the field of personal credit wind control, a wind control system is developed from manual examination and artificial intelligent automatic examination based on big data, and the wind control system mainly comprises classical machine learning models such as logistic regression and tree models and deep learning models. The model training method in the prior art usually takes overdue and bad account expressions as targets, offline training is carried out by taking a large number of historical orders as samples, the model is deployed on line after training is completed and used for predicting the overdue and bad account probabilities of a user, and a credit auditing result of the user is given by a wind control system according to the model prediction result and in cooperation with other data and strategies. At present, the mainstream wind control system mainly combines the use of strategies and models to perform fraud identification and credit scoring so as to achieve the purpose of reducing credit risk. There are two main risks in credit business: fraud risk and credit risk. Based on the above objectives, risk strategies and models need to be constructed, and the main process comprises: constructing a credit rating card model (A card) based on anti-fraud, credit characteristics, personal information and the like, and determining the credit line, term and interest rate of a user; constructing a behavior model (B card) based on the behavior data of the user for the stroke control assessment in credit; and setting a strategy and a threshold according to the risk preference, and carrying out a return test analysis.

However, the existing wind control system generally has the following obvious defects, and the wind control effect is greatly limited:

1. wind control is abstracted into a supervised classification task, the learning target is to minimize overdue risk and bad account risk, the effect evaluation indexes are limited to overdue rate and bad account rate, and iterative optimization of credit limits, interest rates and time limits is omitted, so that the optimization target of the wind control model is deviated from the business target. The most fundamental goal of business is profit maximization, and due to the existence of the above-mentioned target deviation, the existing wind control system is not the most direct and efficient method for profit maximization.

2. The machine learning process usually requires hundreds of thousands or even tens of millions of samples to be trained offline, and at least several days are needed for accumulating enough magnitude samples in real business, so that the updating of the wind control model is delayed, and when the market trend changes due to policy or large environmental influence, the wind control model cannot rapidly cope with the market change.

Disclosure of Invention

In order to solve the problems that the optimization target of the wind control model is deviated from the commercial target and the updating of the wind control model is delayed in the conventional credit wind control system, the invention provides a wind control model construction method based on reinforcement learning, which comprises the following steps:

receiving a credit request of a user, acquiring data required by a wind control decision, and performing access strategy verification;

processing the data required by the wind control decision, constructing a state vector, and generating a return prediction network according to the state vector and a predefined action space by combining a reinforcement learning model;

the return prediction network calculates to obtain the prediction return after each action in the action space is implemented, and selects and implements the action with the maximum expected return by adopting a preset search strategy;

and calculating the real return after the action is implemented according to the actual repayment result of the user, and updating the parameters of the return prediction network by taking return maximization as a target according to the real return and the prediction return.

The invention also provides a wind control model construction device based on reinforcement learning, which comprises the following steps:

the verification module is used for receiving the credit request of the user, acquiring data required by the wind control decision and verifying the access strategy;

the generating module is used for processing the data required by the wind control decision, constructing a state vector and generating a return prediction network according to the state vector and a predefined action space by combining a reinforcement learning model;

the execution module is used for calculating to obtain the predicted return after each action in the action space is implemented by using the return prediction network, and selecting and implementing the action with the maximum expected return by adopting a preset search strategy;

and the updating module is used for calculating the real return after the action is implemented according to the actual repayment result of the user, and updating the parameters of the return prediction network by taking return maximization as a target according to the real return and the prediction return.

The invention also provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program when executed by the processor implementing the steps of:

The invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

The reinforcement learning-based wind control model construction method and device provided by the invention apply a reinforcement learning wind control model of a profit maximization value function, wherein the value function design obtains return feedback suitable for reinforcement learning by calculating various income and cost in credit business, directly reflects the business target of profit maximization, is applied to updating model parameters and adjusting the reinforcement learning process, so that the optimization target of the wind control model is consistent with the business target, and a more direct and efficient wind control solution is formed. In addition, the reinforcement learning-based wind control model construction method and device provided by the invention can trigger the automatic updating of the parameters of the reward prediction network as required through the updating of the repayment data of the user, so that the time required by sample accumulation is saved, and the change of the industry or the market can be quickly coped with.

Drawings

FIG. 1 is a flowchart of a method for building a reinforcement learning-based wind control model according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a building principle of a reinforcement learning-based wind control model according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a reinforced learning-based wind control model building device according to an embodiment of the present invention;

fig. 4 is a schematic physical structure diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Referring to fig. 1 and fig. 2, the method for building a reinforcement learning-based wind control model according to the embodiment of the present invention includes the following steps:

and step S101, receiving a credit request of a user, acquiring data required by a wind control decision, and verifying an admission strategy.

In specific application, after a credit request of a user is received, data required by a wind control decision is obtained firstly, and then access strategy verification is carried out according to the data so as to ensure legal compliance of a loan relation. The data includes user information and industry environment data. The user information comprises loan requirements, repayment capacity, credit assessment, credit history behavior, current business links and other aspects. The industry environment data is the description of policy guidance, economic situation, emergency and the like, and comprises the general environment and the subdivided environments of industries, regions and the like where users are located. The admission strategy verification comprises the following steps: determining that the borrower has complete civil activity, that the borrower is not used for illicit criminal activity, and the like.

And S102, processing the user information and the industry environment data to construct a state vector.

In the reinforcement learning concept: the agent is in environment E with a state space of X, where each state X E X is an environment description perceived by the agent. And processing user information and industry environment data by combining a wind control service scene, and constructing a state vector as the input of a strong learning model in the wind control system.

The method for processing the user information and the industry environment data mainly comprises the following steps: and cleaning user information and industrial environment data, and carrying out digital standardized coding on the cleaned data. The cleaning process mainly comprises missing value processing and abnormal value processing. The missing value processing is divided into two cases according to whether the missing rate is higher than a threshold value: 1) directly deleting the information fields with the missing rate larger than the threshold value; 2) for information fields with the missing rate less than or equal to the threshold, the missing values in the information fields are filled with fixed values, and the fixed values can be the average value, the median value or the mode of all non-missing values in the information fields. And the abnormal value processing adopts the box separation to eliminate the influence of the abnormal value, and the box separation mode comprises equal frequency, equal width or clustering and the like. The process of digital standardized encoding includes a digitizing operation, a normalizing operation, and an encoding operation. The digitization operation is to map character strings or non-numerical information fields into numerical types, so that the reinforcement learning model can be conveniently read and processed. The normalization operation is to convert the data into a non-dimensionalized index value to make the information fields of different measurements comparable, for example, the embodiment uses a formula

Carrying out standardization, wherein x' is a value after standardization, x is a value before standardization, mean is a mean value,

is the standard deviation. The coding operation comprises serial number coding, one-hot coding and the like, wherein the serial number coding is used for processing information fields with size relations and mapping values into serial numbers in the sequence from large to small or from small to large; the one-hot coding is used for processing information fields without size relations, and an enumerated information field is converted into a plurality of binary information fields.

And S103, generating a return prediction network according to the state vector and the predefined action space by combining a reinforcement learning model.

The action belongs to an action space suitable for the wind control service. In the reinforcement learning concept: the actions that the agent can take constitute an action space a, and if some action a e a acts on the current state x, the potential transfer function will cause the environment to transfer from the current state to another state with some probability. In combination with a wind control service scene, an action a needs to determine the credit line, interest rate, time limit of a user and the decision result of whether to pass a credit request, and the value of the above variables can be determined to determine the unique action a, so that an action space A is defined as all possible value combinations of the credit line, the interest rate, the time limit, whether to pass the variable and the like.

The following example of the definition of the action space of the wind control service according to the embodiment of the present invention is as follows: assuming a personal credit business scene, the credit limit range is 30000-50000 yuan, the annual interest rate is 8% -12%, the term can be selected from 6 months and 12 months, and a simple action space A is defined as shown in the following table 1.

In specific application, the return prediction Network may select a classical neural Network model including Q-Learning, DQN (Deep Q Network), Double DQN, etc., or adjust a Network structure, parameters, etc. based on the neural Network model to meet personalized requirements of a specific service scenario.

And step S104, calculating by a return prediction network to obtain the prediction return after each action in the action space is implemented, selecting and implementing the action with the maximum expected return by adopting a preset search strategy, and returning credit lines, interest rates, time limits and credit passing decision results corresponding to the actions to the user.

When a user initiates a credit granting or borrowing request, the reinforcement learning model takes the state vector as input, adopts a preset search strategy (such as a greedy exploration strategy) to select and implement the action with the maximum expected return, and simultaneously returns the results of credit limit, interest rate, time limit, passing or not and the like corresponding to the action to the user to complete the auditing process. It should be noted that: in a debit request, it may happen that a line of credit, a term of age, has been selected by the user, making part of the action not optional, but not affecting the search of the action space.

In the reinforcement learning concept: while the state is transferred, the environment is fed back to the agent in a return according to the potential value function R (s, a), and the learning goal of the agent is to achieve the maximum return. The cost function R (s, a) represents the return from the state s after the action a is performed. In combination with a business scenario, the wind control system is used as a part of business behaviors, and the fundamental goal is profit maximization, so the return in reinforcement learning of the embodiment of the invention is defined as annual profit margin, and the value function R is an annual profit margin calculation formula. The reinforcement learning model selects and implements the action with the maximum predicted annual profit rate output by the return prediction network to generate real return.

In this embodiment, taking the repayment mode from rest first and then from rest (paying interest per month and returning principal fund after the expiration) as an example, the annual profit margin calculation formula of one order is as follows: annual profit margin = [ profit margin/deadline (day) ] x 365 days x 100%, profit margin = profit/payout amount, profit = income-cost = real payback fee-bad account-capital cost-data cost-other cost, bad account = payback money + payback money-real payback fee, capital cost = payout amount x deadline (day) x annual fund interest rate/365 days. The income comprises real interest charges, and the cost comprises bad accounts, capital cost, data cost and other cost (such as customer acquisition cost, server equipment, research and development cost and the like). The annual capital interest rate is a fixed value that is agreed with the contracting funder. The data cost is the sum of the costs of all data sources actually called by the order. For a data source charged by a single call, the cost is the unit price of the data source. And (4) for the data source charged in a packet year, uniformly distributing and estimating the single call cost according to the annual call amount. In the case where the user's credit request is denied, the real return charge, bad account, and capital costs for the order are all zero.

And step S105, after the credit order is expired, acquiring an actual repayment result of the user, calculating a real return after action implementation, and updating return prediction network parameters in the reinforcement learning model by taking return maximization as a target according to the real return and prediction return.

The return prediction network updates each parameter with the aim of maximizing annual profit margin, and the process comprises the following steps: assigning values to each parameter of the return prediction network in a random initialization mode; updating each parameter of the return prediction network according to the real return and the prediction return of the historical credit order; and continuously performing online learning by the return prediction network, and updating each parameter of the return prediction network again according to the real return and the prediction return of each credit order after each credit order is expired. The updating process is continuously repeated on line until the reinforcement learning model is off line. The updating of the repayment data of the user can trigger the automatic updating of the parameters of the reward prediction network according to needs, so that the time required by sample accumulation is saved, and the change of industries or markets can be quickly coped with.

The embodiment of the invention applies the reinforcement learning wind control model of the profit maximization value function, wherein the value function design obtains the return feedback suitable for reinforcement learning by calculating all income and cost in credit business, directly reflects the commercial goal of profit maximization, is applied to updating model parameters and adjusting the reinforcement learning process, so that the optimization goal of the wind control model is consistent with the commercial goal, and a more direct and efficient wind control solution is formed. The embodiment of the invention comprehensively considers the decision problems of credit limit, term, interest rate, passing or not of the wind control business, provides action space definition suitable for reinforcement learning and wind control business innovation, and introduces reinforcement learning to construct a set of universal, complete and unified wind control solution. According to the embodiment of the invention, according to information such as loan requirements, repayment capacity, credit assessment, credit historical behaviors, current business links, policy guidance of industry, economic forms, emergencies and the like of a user, a comprehensive, accurate and quantifiable state space is constructed from two aspects of user information and industry environment, and a data basis is provided for a reinforcement learning wind control model. The prediction result of the wind control model constructed by the method provided by the embodiment of the invention can be used for anti-fraud identification, credit quota pricing, composite wind control strategy and the like.

Referring to fig. 3, an embodiment of the present invention further provides a reinforcement learning-based wind control model building apparatus, where the apparatus includes:

the verification module 301 is configured to receive a credit request of a user, acquire data required by a wind control decision, and perform admission policy verification;

the generating module 302 is configured to process data required by a wind control decision, construct a state vector, and generate a return prediction network according to the state vector and a predefined action space in combination with a reinforcement learning model;

an executing module 303, configured to calculate, using a reward prediction network, to obtain a predicted reward after each action in the action space is implemented, and select and implement an action with a maximum expected reward by using a preset search policy;

and the updating module 304 is configured to calculate a real reward after the action is implemented according to an actual repayment result of the user, and update a parameter of the reward prediction network with the reward maximization as a target according to the real reward and the prediction reward.

Wherein the generating module 302 further comprises:

the cleaning unit is used for cleaning data required by the wind control decision;

the digital unit is used for mapping character strings or non-numerical information fields in the data cleaned by the cleaning unit into numerical types;

the standardization unit is used for converting the data processed by the digitization unit into a non-dimensionalization index value so that information fields with different measurements have comparability;

the serial number coding unit is used for processing the information fields with the size relationship after being processed by the standardization unit and mapping the values into serial numbers according to the size sequence;

the one-hot coding unit is used for processing the information fields without the size relationship after being processed by the standardization unit and converting an enumeration type information field into a plurality of binary information fields;

the construction unit is used for constructing the information fields processed by the digitization unit, the standardization unit, the serial number coding unit and the one-hot coding unit into a state vector;

and the network generation unit is used for generating a return prediction network according to the state vector and the predefined action space by combining a reinforcement learning model.

Wherein, the update module 304 further comprises:

the initialization unit is used for assigning values to all parameters of the return prediction network in a random initialization mode and updating all parameters of the return prediction network according to the real return and the prediction return of the historical credit order;

and the calculation updating unit is used for calculating the real return after the action is implemented according to the actual repayment result of the user, and updating the parameters of the return prediction network again by taking the return maximization as a target according to the real return and the prediction return.

It should be noted that other corresponding descriptions of the functional modules involved in the wind control model construction device based on reinforcement learning provided in the embodiment of the present invention may refer to the corresponding descriptions of the methods shown in fig. 1 and 2, and are not described herein again.

Based on the above-mentioned methods as shown in fig. 1 and fig. 2, correspondingly, an embodiment of the present invention further provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the following steps: receiving a credit request of a user, acquiring data required by a wind control decision, and performing access strategy verification; processing data required by wind control decision making, constructing a state vector, and generating a return prediction network according to the state vector and a predefined action space by combining a reinforcement learning model; the return prediction network calculates to obtain the prediction return after each action in the action space is implemented, and selects the action with the maximum expected return by adopting a preset search strategy and implements the action; and calculating the real return after the action is implemented according to the actual repayment result of the user, and updating the parameters of the return prediction network by taking return maximization as a target according to the real return and the prediction return.

Based on the above embodiments of the method shown in fig. 1 and 2 and the apparatus shown in fig. 3, an embodiment of the present invention further provides an entity structure diagram of a computer device, as shown in fig. 4, where the computer device includes: a memory 41, a processor 42 and a computer program stored on the memory 41 and executable on the processor, wherein the memory 41 and the processor 42 are both provided on a bus 43, the processor 42 implementing the following steps when executing said program: receiving a credit request of a user, acquiring data required by a wind control decision, and performing access strategy verification; processing data required by wind control decision making, constructing a state vector, and generating a return prediction network according to the state vector and a predefined action space by combining a reinforcement learning model; the return prediction network calculates to obtain the prediction return after each action in the action space is implemented, and selects the action with the maximum expected return by adopting a preset search strategy and implements the action; and calculating the real return after the action is implemented according to the actual repayment result of the user, and updating the parameters of the return prediction network by taking return maximization as a target according to the real return and the prediction return.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

In practical applications, each functional module and each unit related in the embodiments of the present invention may be implemented by a computer program running on computer hardware, where the program may be stored in a computer-readable storage medium, and when the program is executed, the program may include the flow of the embodiments of the methods described above. Wherein, the hardware refers to a server or a desktop computer, a notebook computer, etc. containing one or more processors and storage media; the storage medium can be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like; the computer program is implemented in a computer language not limited to C, C + +, or the like.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention, and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for constructing a wind control model based on reinforcement learning is characterized by comprising the following steps:

2. The reinforcement learning-based wind control model building method according to claim 1, wherein the data required for the wind control decision comprises user information and industry environment data; the admission policy validation comprises: it is determined that the borrower has complete civil activity and that the borrowing is not used for criminal activity.

3. The reinforcement learning-based wind control model building method according to claim 2, wherein the user information includes loan requirements, repayment capacity, credit assessment, credit historical behavior and current business links; the industry environment data is the description of policy guidance, economic situation and emergency, including the general environment and the subdivided environment of the industry and the region where the user is located.

4. The reinforcement learning-based wind control model building method according to claim 2, wherein the step of processing the data required for the wind control decision is to clean the user information and industry environment data, and perform digital standardized coding on the cleaned data; the cleaning process comprises missing value processing and abnormal value processing; the missing value processing is divided into two cases according to whether the missing rate is higher than a threshold value: 1) directly deleting the information fields with the missing rate larger than a threshold value; 2) filling missing values in information fields with missing rates smaller than or equal to a threshold value with fixed values, wherein the fixed values are the average value, the median value or the mode of all non-missing values in the information fields; the abnormal value processing adopts the box separation to eliminate the influence of the abnormal value, and the box separation mode comprises equal frequency, equal width or clustering; the digital standardized coding process comprises a digitalizing operation, a standardizing operation and a coding operation; the digital operation is to map character strings or information fields of non-numerical type into numerical type; the standardization operation is to convert the data into a non-dimensionalization index value, so that the information fields of different measurements have comparability; the encoding operation comprises serial number encoding and one-hot encoding, wherein the serial number encoding is used for processing information fields with size relations and mapping values into serial numbers according to the size sequence; the one-hot coding is used for processing information fields without size relations and converting an enumerated information field into a plurality of binary information fields.

5. The reinforcement learning-based wind control model building method according to claim 1, wherein the action space is a credit limit, interest rate, time limit and whether all possible value combinations pass through a plurality of variables; the return prediction network is a classical neural network model and comprises Q-Learning, DQN and Double DQN; the search strategy is a greedy exploration strategy.

6. The reinforcement learning-based wind control model building method according to claim 1, wherein the return is defined as an annual profit margin = [ profit margin/deadline (days) ] x 365 days x 100%; wherein profit margin = profit/payout amount; profit = revenue-cost = real return fee-bad ledger-capital cost-data cost-other cost; bad account = pay money + pay money-pay money; capital cost = payout amount x term (day) x annualized capital interest rate/365 days; the annualized capital interest rate is a fixed value.

7. The reinforcement learning-based wind control model building method according to claim 1, wherein the step of updating the parameters of the reward prediction network comprises:

assigning values to all parameters of the return prediction network in a random initialization mode;

updating each parameter of the return prediction network according to the real return and the predicted return of the historical credit order;

and continuously performing online learning by the return prediction network, and updating all parameters of the return prediction network again according to the real return and the prediction return of each order after each credit order is expired.

8. A wind control model building device based on reinforcement learning is characterized by comprising the following components:

9. The reinforcement learning-based wind control model building device according to claim 8, wherein the generating module comprises:

the cleaning unit is used for cleaning the data required by the wind control decision;

the standardization unit is used for converting the data processed by the digitization unit into a non-dimensionalization index value so that information fields of different measurements have comparability;

a serial number coding unit, which is used for processing the information fields with the size relationship processed by the standardization unit and mapping the values into serial numbers according to the size sequence;

the one-hot coding unit is used for processing the information fields without the size relationship processed by the standardization unit and converting an enumerated information field into a plurality of binary information fields;

and the network generation unit is used for generating a return prediction network according to the state vector and a predefined action space by combining a reinforcement learning model.

10. The reinforcement learning-based wind control model building device according to claim 9, wherein the updating module comprises:

and the calculation updating unit is used for calculating the real return after the action is implemented according to the actual repayment result of the user, and updating the parameters of the return prediction network again by taking return maximization as a target according to the real return and the prediction return.

11. A computer arrangement comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 7 when executed by the processor.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 7.