CN115809917A

CN115809917A - Risk rating method and device, electronic equipment and storage medium

Info

Publication number: CN115809917A
Application number: CN202211597645.9A
Authority: CN
Inventors: 龚妙岚; 王钰; 蒋海俭; 李嘉
Original assignee: China Unionpay Co Ltd
Current assignee: China Unionpay Co Ltd
Priority date: 2022-12-12
Filing date: 2022-12-12
Publication date: 2023-03-17

Abstract

The disclosure provides a risk rating method, a risk rating device, an electronic device and a storage medium, wherein the method comprises the following steps: responding to a current business operation instruction of a target user, and acquiring business operation information corresponding to the business operation instruction; determining a risk characteristic factor for evaluating the magnitude of the business risk and a user feedback factor for evaluating the magnitude of the user feedback risk based on the business operation information; inputting the initial risk level of the target user, the determined risk characteristic factor and the user feedback factor into a pre-trained reinforcement learning model, and determining a risk level change value output by the model; and updating the initial risk level of the target user based on the risk level change value to obtain an updated risk level. According to the method and the device, along with the updating of the user state, the risk level of the user can be adjusted in a self-adaptive mode by using the pre-trained reinforcement learning model, so that the updated risk level can be fitted with the latest state of the user state to a greater extent, and the accuracy of the rating result is higher.

Description

Risk rating method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer application technologies, and in particular, to a risk rating method and apparatus, an electronic device, and a storage medium.

Background

In recent years, with the rapid development of the information industry and the popularization of the mobile internet, more and more financial businesses are involved. In order to ensure the safety of various financial businesses, it is important to evaluate the risk level of the user.

In the prior art, algorithms such as k-nearest neighbor, decision tree, neural network and the like are commonly used to classify or cluster users according to risk characteristics, all users in a product are divided into a plurality of user groups according to the strength of the risk characteristics, and then the high-risk user group is subjected to strong wind control and the low-risk user group is subjected to weak wind control, so that user risk classification is realized.

However, in practical applications, the payment behavior and the status of the user are constantly and dynamically changed, and the changes cannot be well dealt with regardless of the classification scheme.

Disclosure of Invention

The embodiment of the disclosure at least provides a risk rating method and device, electronic equipment and a storage medium, so as to adaptively adjust a risk level and improve the accuracy of the risk rating.

In a first aspect, an embodiment of the present disclosure provides a risk rating method, including:

responding to a current business operation instruction of a target user, and acquiring business operation information corresponding to the business operation instruction;

determining a risk characteristic factor for evaluating the magnitude of the business risk and a user feedback factor for evaluating the magnitude of the user feedback risk based on the business operation information;

inputting the initial risk level of the target user, the determined risk characteristic factor and the user feedback factor into a pre-trained reinforcement learning model, and determining a risk level change value output by the model;

and updating the initial risk level of the target user based on the risk level change value to obtain an updated risk level.

In one possible embodiment, the risk profile includes one or both of the following features:

the behavior risk characteristics are used for evaluating the behavior risk of the user, and the transaction risk characteristics are used for evaluating the transaction risk;

the user feedback factor includes one or both of the following features:

the system comprises a user display feedback characteristic used for evaluating the risk of direct feedback of the user and a user implicit feedback characteristic used for evaluating the risk of indirect feedback of the user.

In one possible embodiment, the reinforcement learning model is trained as follows:

acquiring multiple initial risk levels corresponding to multiple users; each user belonging to one of said plurality of initial risk levels;

determining risk characteristic factors and user feedback factors of each user under each initial risk level; each risk characteristic factor is used for evaluating the business risk of the corresponding user, and each user feedback factor is used for evaluating the user feedback risk of the corresponding user;

and taking the risk characteristic factors and the user feedback factors of each user under each initial risk level as input data of the reinforcement learning model to be trained, and carrying out at least one round of training on the reinforcement learning model to obtain a pre-trained reinforcement learning model.

In a possible embodiment, the performing at least one round of training on the reinforcement learning model includes:

for each initial risk level, taking risk characteristic factors and user feedback factors of each user under the initial risk level as input data of a reinforcement learning model to be trained, and determining updated risk levels and corresponding strategy reward values obtained after an agent in the reinforcement learning model executes next action according to a current action strategy;

and circularly executing the following steps until a model convergence condition is reached:

updating the current action strategy to obtain an updated action strategy under the condition that the model convergence condition is not reached based on the corresponding strategy reward value;

and taking the risk characteristic factors and the user feedback factors of each user under the updated risk level as input data of the reinforcement learning model to be trained, and determining the updated risk level and the corresponding strategy reward value obtained after the intelligent agent in the reinforcement learning model executes the next action according to the updated action strategy.

In a possible implementation manner, the determining an updated risk level and a corresponding policy reward value obtained after an agent in the reinforcement learning model executes a next action according to a current action policy includes:

determining a risk level change value selected from a plurality of risk level change values after an agent in the reinforcement learning model executes the next action according to the current action strategy;

and updating the initial risk level based on the selected risk level change value to obtain an updated risk level.

determining the state transition probability and action reward value of the agent in the reinforcement learning model for executing the next action to the updated risk level according to the current action strategy;

determining the policy award value based on a multiplication operation between the state transition probability and the action award value.

In one possible embodiment, the model convergence condition comprises one of the following conditions:

the number of loop iterations reaches the preset number;

the difference value between the strategy reward and the value corresponding to the two circulations is smaller than the preset difference value; the strategy reward and value corresponding to each circulation is determined by the sum of the strategy reward values corresponding to various risk levels when the next action is executed according to the corresponding action strategy.

In one possible embodiment, after said obtaining the updated risk level, the method further comprises:

determining a target wind control intervention intensity value for the target user based on the updated risk level;

and determining a target wind control strategy corresponding to the target wind control intervention strength value based on the corresponding relation between each wind control intervention strength value and each wind control strategy.

In one possible embodiment, the method further comprises:

feeding back risk handling recommendation information in the target wind control strategy to a client of the target user.

In a second aspect, the present disclosure also provides a risk rating apparatus, including:

the acquisition module is used for responding to the current business operation instruction of the target user and acquiring the business operation information corresponding to the business operation instruction;

the determining module is used for determining a risk characteristic factor for evaluating the magnitude of the business risk and a user feedback factor for evaluating the magnitude of the user feedback risk based on the business operation information; inputting the initial risk level of the target user, the determined risk characteristic factor and the user feedback factor into a pre-trained reinforcement learning model, and determining a risk level change value output by the model;

and the updating module is used for updating the initial risk level of the target user based on the risk level change value to obtain an updated risk level.

In a third aspect, the present disclosure also provides an electronic device, including: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the risk rating method as described in the first aspect and any of its various embodiments.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the risk rating method according to the first aspect and any of its various embodiments.

By adopting the risk rating method, the risk rating device, the electronic equipment and the storage medium, the business operation information corresponding to the business operation instruction can be obtained in response to the current business operation instruction of the target user, then the risk characteristic factor for evaluating the business risk and the user feedback factor for evaluating the user feedback risk are determined based on the business operation information, the risk grade change state of the target user can be predicted based on the information and the pre-trained reinforcement learning model, and finally the initial risk grade of the target user can be updated based on the predicted risk grade change value to obtain the updated risk grade. According to the method and the device, along with the updating of the user state, the risk level of the user can be adjusted in a self-adaptive mode by using the pre-trained reinforcement learning model, so that the updated risk level can be fitted with the latest state of the user state to a greater extent, and the accuracy of the rating result is higher.

Other advantages of the present disclosure will be explained in more detail in conjunction with the following description and the accompanying drawings.

It should be understood that the above description is only an overview of the technical solutions of the present disclosure, so that the technical solutions of the present disclosure can be more clearly understood and implemented according to the contents of the specification. In order to make the aforementioned and other objects, features and advantages of the present disclosure comprehensible, specific embodiments thereof are described below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings required for use in the embodiments will be briefly described below, and the drawings herein incorporated in and forming a part of the specification illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the technical solutions of the present disclosure. It is appreciated that the following drawings depict only certain embodiments of the disclosure and are therefore not to be considered limiting of its scope, for those skilled in the art will be able to derive additional related drawings therefrom without the benefit of the inventive faculty. Also, like reference numerals are used to refer to like elements throughout. In the drawings:

FIG. 1 illustrates a flow chart of a risk rating method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a reinforcement learning model building in a risk rating method provided by an embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a risk rating apparatus provided by an embodiment of the present disclosure;

fig. 4 shows a schematic diagram of an electronic device provided by an embodiment of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

In the description of the embodiments of the present disclosure, it is to be understood that terms such as "including" or "having" are intended to indicate the presence of the features, numbers, steps, actions, components, parts, or combinations thereof disclosed in the specification, and are not intended to preclude the presence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

A "/" indicates an OR meaning, for example, A/B may indicate A or B; "and/or" herein is merely an association relationship describing an associated object, and means that there may be three relationships, for example, a and/or B, and may mean: a exists alone, A and B exist simultaneously, and B exists alone.

The terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first," "second," etc. may explicitly or implicitly include one or more of such features. In the description of the embodiments of the present disclosure, "a plurality" means two or more unless otherwise specified.

Research shows that it is important to evaluate the risk level of a user in order to ensure the safety of various financial services. By taking a cloud flashing payment Application program (APP) as an example, the cloud flashing payment APP is an APP with a mobile payment function as a core, and is responsible for identity verification of links such as user registration, card binding, login, service function opening and payment, and directly undertakes risk loss and compensation responsibility caused by counterfeit card binding, registered account embezzlement and information leakage, so that risk prevention and control working moments of a cloud flashing payment user side cannot be relaxed.

The mobile payment market is competitive, and in the 'experience king' era, users and competitors such as WeChat, paibao and the like pay great attention to the user experience. Therefore, how to balance the relationship between the wind control strength and the disturbance of the user on the premise of protecting the fund security of the user and enable the normal user to have no perception of the wind control, and the payment activity is carried out safely and conveniently becomes a subject needing continuous exploration.

In the prior art, algorithms such as k-nearest neighbor, decision tree, neural network and the like are often used to classify or cluster users according to risk characteristics, all users in a product are divided into a plurality of user groups according to the strength of the risk characteristics, and then the high-risk user group is subjected to strong wind control and the low-risk user group is subjected to weak wind control, so that user risk classification is realized.

The K-nearest neighbor algorithm is a clustering algorithm based on Euclidean distance, and the closer the distance between two targets is, the greater the similarity is. The decision tree may learn from a given sample and derive a classifier that is then used to classify subsequent newly emerging objects. The neural network algorithm can establish a set of mechanism similar to human brain, the neural network is generally designed to comprise an input layer, a hidden layer and an output layer, a calculation mode is designed for each layer of neurons, and information to be classified is input into the neural network during training and use and is handed over to each layer of neurons in the network to process and output a classification result.

In the classification algorithm in the prior art, the k-nearest neighbor algorithm has higher dependence on initial parameters, and the clustering result is greatly influenced by the experience of parameter setting personnel; the decision tree algorithm is greatly influenced by the learning samples, and the over-fitting phenomenon is easy to occur; complicated neural network algorithms often have hidden layers, the interpretability of the algorithms is poor due to the black box effect of the hidden layers, and the training of the neural network has high requirements on data volume and computing resources.

The existing classification algorithm has the following defects or problems: firstly, the fact that dynamic changes of objects to be classified possibly exist in the running process of the wind control system is not considered, various characteristics of the objects are randomly changed along with the advancing of time, and then the objects to be classified deviate from the current classification. Corresponding to the risk prevention and control scene, if the user risk classification system does not timely sense various feedbacks of the user to adjust the strategy intensity, the wind control strategy intensity corresponding to the current classification of the user is no longer suitable for the user, and the risk exposure is increased or the risk exposure is too strong to be beneficial to the product experience of the user. Secondly, the overall influence of the wind control system on the product is unstable, when a large number of users change towards the same direction within a similar time range, the risk intervention strength of the users becomes stronger or weaker, and further the influence of the wind control system on the product becomes uncontrollable. For example, during marketing campaign development, the number of active users of a product will be integrally increased, and if the user rating standard is unchanged, the number of users in a user group at some risk level may be increased drastically during the campaign, which may further increase the risk policy trigger amount based on some intervention strength, thereby causing negative impact on the product as a whole.

In order to at least partially solve one or more of the above problems and other potential problems, the present disclosure provides at least one risk rating scheme based on a reinforcement learning model, which can collect explicit and implicit feedback of a user during the product usage process, periodically generate a user risk intervention strength transition strategy in combination with risk characteristics, and automatically adjust the user risk intervention strength according to the strategy and characteristic values of the user at the current time. The user risk intervention strength transfer strategy is learned and generated according to the user state and characteristics of the whole system, iteration is continuously tracked to adapt to the latest state of the user, risks carried out by applying the strategy are more suitable for the user requirements, and user experience is improved.

To facilitate understanding of the present embodiment, first, a risk rating method disclosed in an embodiment of the present disclosure is described in detail, where an execution subject of the risk rating method provided in the embodiment of the present disclosure is generally an electronic device with certain computing capability, and the electronic device includes, for example: a terminal device, which may be a User Equipment (UE), a mobile device, a User terminal, a handheld device, a vehicle-mounted device, a wearable device, or a server or other processing device. In some possible implementations, the risk rating method may be implemented by a processor invoking computer readable instructions stored in a memory.

Referring to fig. 1, a flowchart of a risk rating method provided by an embodiment of the present disclosure is shown, the method includes steps S101 to S104, where:

s101: responding to a current business operation instruction of a target user, and acquiring business operation information corresponding to the business operation instruction;

s102: determining a risk characteristic factor for evaluating the magnitude of the business risk and a user feedback factor for evaluating the magnitude of the user feedback risk based on the business operation information;

s103: inputting the initial risk level of the target user, the determined risk characteristic factor and the user feedback factor into a pre-trained reinforcement learning model, and determining a risk level change value output by the model;

s104: and updating the initial risk level of the target user based on the risk level change value to obtain an updated risk level.

In order to facilitate understanding of the risk rating method provided by the embodiments of the present disclosure, an application scenario of the method is first described in detail. The risk rating method in the embodiment of the disclosure is mainly applied to various financial fields, especially to the related financial business field with wind control requirements, and can relate to various business operation behaviors including registration, login, card binding, button clicking, coupon picking, page visiting, payment, account transfer, repayment and the like.

The relevant business operation instructions can be initiated by the target user on a specific client, for example, the target user performs fund-related business operations such as login, password recovery, card binding, payment and account transfer in a cloud flash payment APP. Different business operations correspond to different business operation information, various risk characteristics and user feedback data of the user can be calculated according to the business operation information, and then the risk level of the user is updated based on the latest action strategy in the pre-trained reinforcement learning model.

It should be noted that, the target user may be one, so that the risk level of the target user may be updated, or multiple target users may be updated, so that the risk levels of multiple target users may be updated at the same time, where no specific limitation is made herein, and in addition, the service operation information of the target user may include not only the current operation information corresponding to the current request initiation but also historical operation information corresponding to the historical state, where the historical operation information may be bound to the target user, and along with the initiation of the current operation information, the operation information may be updated and may serve as operation information subsequently directed to the user.

In practical application, the receiving of the request and the extraction of the information can be specifically realized through the introduced wind control system, then, the wind control system calculates various risk characteristics and user feedback data of the user in real time according to a time window and the whole system, and the risk rating (namely the updated risk level) of the user is updated by combining with the latest action strategy.

In practical applications, the relevant features of the target user after updating the risk level can be used as input data for the next round of reinforcement learning model training and corresponding action strategy updating, so as to further improve the learning performance of the reinforcement learning model.

The corresponding wind control intervention strength value may be determined based on the updated risk level, e.g., the higher the updated risk level, the higher the corresponding strength value, or vice versa, such that the target wind control strategy corresponding to the target wind control intervention strength value is determined based on the correspondence between each wind control intervention strength value and each wind control strategy.

The wind control strategy set can only remind people of risks through short messages, can also remind people through short messages and recognize faces, can also be other wind control strategies, is not limited specifically, and can be determined by combining actual application scenes.

In the embodiment of the disclosure, risk handling suggestion information in the target wind control strategy is fed back to the client of the target user, and in practical application, the risk handling suggestion information can be returned to the cloud flash payment APP background for use by a product side, for example, the user can be prompted to have a transfer risk and the like under the condition that the current transfer has a transaction risk.

In the process of determining the risk level change value, firstly, the risk characteristic factor and the user feedback factor need to be determined based on the business operation information, and then the information is used as the input data of the pre-trained reinforcement learning model, and the model can output the risk level change value, for example, the risk level change value can be upgraded from the risk level 1 to the risk level 3, and the total change is 2.

The risk characteristic factors mainly comprise behavior risk characteristics used for evaluating the behavior risk of the user and transaction risk characteristics used for evaluating the transaction risk. The behavior risk characteristics can comprise behaviors that a user accesses and clicks on an App specified page; non-fund transaction behaviors such as password modification, equipment change, card unbinding and the like; the transaction risk characteristics herein may include the amount, time, frequency, counter-party, etc. of the funds transaction, such as consumption, transfer, etc.

In addition, the user feedback factors mainly comprise a user display feedback characteristic used for evaluating the risk of direct feedback of the user and a user implicit feedback characteristic used for evaluating the risk of indirect feedback of the user. The user display feedback feature may include a user denying the transaction, reporting information provided by the payee or merchant; the implicit feedback characteristics of the user can comprise information such as abnormal stay, abnormal attempt and the like of some business links.

Here, to the risk level after the update, can adopt corresponding wind control strategy to deal with to under the prerequisite that provides more accurate wind control measure, further promote user experience, also promptly, can not miss the wind control under the circumstances that the risk level promoted, also can not under the circumstances that the risk level descends, increase the wind control.

In consideration of the key role of the pre-training of the reinforcement learning model on the risk rating, the following process for training the reinforcement learning model is mainly described, and specifically includes the following steps:

the method comprises the following steps of firstly, obtaining multiple initial risk levels corresponding to multiple users; each user belongs to one of a plurality of initial risk levels;

step two, aiming at each initial risk grade, determining risk characteristic factors and user feedback factors of each user under the initial risk grade; each risk characteristic factor is used for evaluating the business risk of the corresponding user, and each user feedback factor is used for evaluating the user feedback risk of the corresponding user;

and thirdly, taking the risk characteristic factors and the user feedback factors of the users under each initial risk level as input data of the reinforcement learning model to be trained, and performing at least one round of training on the reinforcement learning model to obtain a pre-trained reinforcement learning model.

Here, multiple initial risk levels can be determined, then, risk characteristic factors and user feedback factors of each user under the levels can be determined for each initial risk level, finally, the two factors are used as input data of the reinforcement learning model to be trained to conduct model training, and the reinforcement learning model to be trained can be obtained through multiple rounds of loop iteration.

The process related to model training, namely the process of optimizing action strategy, can reach the convergence condition of the model through multiple rounds of loop iteration.

Firstly, aiming at each initial risk level, taking risk characteristic factors and user feedback factors of users under the initial risk level as input data of a reinforcement learning model to be trained, and determining updated risk levels and corresponding strategy reward values obtained after an agent in the reinforcement learning model executes next action according to a current action strategy.

Then, the following steps are executed in a loop until the model convergence condition is reached:

It can be known that, under the condition that the model convergence condition is not reached, the action strategy needs to be updated, so that the updated action strategy selects a risk level change value with a higher strategy reward value, and the action strategy iterates in sequence until the model convergence in the whole environment state is reached.

With the updating of the action strategy, the risk level change value with higher strategy reward value can be selected from a plurality of risk level change values, and the risk level updating is carried out based on the change value.

In the embodiment of the present disclosure, the policy reward value corresponding to the next action executed according to the current action policy may be implemented by the following steps:

step one, determining the state transition probability and action reward value of an agent in a reinforcement learning model from the next action to an updated risk level according to a current action strategy;

step two, determining the strategy reward value by the product operation between the state transition probability and the action reward value.

Here, a higher policy award value can be determined when the state transition probability is higher and the action award value is higher, and a smaller policy award value can be obtained when the feedback is given, or when the state transition probability is smaller or the action award value is smaller.

To facilitate a further understanding of the above process, a detailed description may be provided below in conjunction with fig. 2.

As shown in fig. 2, it is assumed that there is an agent in the unknown environment, and the unknown environment may refer to an environment where all users of the cloud-based payment APP perform various actions such as registration, login, card binding, button clicking, coupon picking, page access, payment, account transfer, payment, and the like, and the operation action of each user cannot be predicted.

The agent will continuously interact with the unknown environment and take action a ∈ A on the environment according to the state S ∈ S of the environment in combination with the agent action policy π (a | S). The strategy pi (a | s) is a function related to risk features and user feedback:

is the risk characteristic value of all users in the state s, f(s) is the feedback value of all users in the state s, mu ₁ And mu ₂ Risk characteristics and user feedback parameters, respectively.

Behavioral risk characteristics: the behavior that a user accesses and clicks on an App specified page is included; and modifying non-fund transaction behaviors such as password, equipment change, card unbinding and the like.

Transaction risk characterization: including the amount, time, frequency, counter-party, etc. of the funds transaction, including consumption, transfer, etc.

f(s)

= f { user shows feedback 1, user shows feedback 2, \ 8230; \8230;, user implicit feedback 1, user implicit feedback 2, \8230; }

User display feedback: including the user denying the transaction, reporting information provided by the payee or merchant.

Implicit feedback of a user: including information about the stay of an exception, an exception attempt, etc. in some business segments.

Each time the agent performs an action, the context will go from one state s to another state s', the agent can obtain a reward R ∈ R from the context, and the agent will take action with the goal of maximizing the cumulative reward V(s) and learn to generate the optimal action policy π(s).

It can be known that, the main improvement point of the risk rating method provided by the embodiment of the present disclosure is to optimize an action policy function model, add a risk characteristic factor and a user feedback factor into a state transition policy function, set a possible transformation path and an incentive value of a user state according to a service requirement, and solve multiple rounds of value functions until convergence to obtain an action policy with an optimal current period.

Here, first, according to an existing risk rating classification algorithm, user risk features can be extracted to give an initial risk rating to a user, the risk level of the cloud flash payment APP can be classified into 1 level to n levels from low to high, and each user belongs to one of the categories.

Each risk level is taken as a system state, and the model comprises n states:

s∈{G ₁ ，G ₂ ，G ₃ ，……，G _n }

generating a string of time sequences from agent and environment interactionsColumn t =1,2, \8230;, t. The sequence consists of state, action and reward alternation cycle, in the strategy executing process, the strategy body continuously accumulates environment information, learns the optimal strategy and decides the next action. Respectively recording the state, the behavior and the reward at the time t as S _t ，A _t ，R _t The whole history process from the initial time to the t time is as follows:

H _t ＝S ₁ ，A ₁ ，R ₁ ，S ₂ ，A ₂ ，R ₂ ，……，A _t ，R _t ，S _t

current state S _t Is a function S of all existing information sequences _t ＝f(H _t ) In the process, the agent guides the action strategy taken by the individual in the environment under the state s to be as follows:

the action policy incorporates risk profile factors and user feedback factors,

After action a is executed, the next state s 'is entered, and then the reward r is obtained, which is marked as a state transition (s, a, s', r). Optional actions for each state include:

A

= { keep unchanged, risk level rise 1 level, \8230: (8230);, rise n level, fall 1 level \8230;, fall n level }

The rewards (i.e., action reward values) obtained by each action in different states are:

the reward function is:

the probability of a state s performing action a into the next state s' to obtain a reward r is recorded using a transition function P:

P(s′，r|s，a)＝P[S _t+1 ＝s′，R _t+1 ＝r|S _t ＝s，A _t ＝a]

the transition probability at this state is:

assuming that all states s and individual strategies π in the process have "Markov" property, the strategy π and cost function V are taken while the state s is in progress _π (s) depends only on the current state, independent of the historical state, and the cost function is:

V _π (s)＝E _π (R _t+1 +γR _t+2 +γ ² R _t+3 +…|S _t ＝s)

where γ is the attenuation coefficient, γ ∈ [0,1], describing that future rewards will have increasingly weaker impact on the agent over time.

The value function is derived by using the Bellman equation as follows:

performing strategy evaluation by using a dynamic programming method, starting from any state cost function, iteratively updating the state cost function according to a given strategy, and adjusting the strategy in time, wherein the state cost function of the k +1 round is calculated after the kth iteration as follows:

until convergence of V _π，k+1 (s)-V _π，k And(s) < epsilon, and calculating an optimal cost function and an optimal strategy.

It is known that the model convergence condition here may be that a difference between the policy reward and the value corresponding to two loops is smaller than a preset difference (i.e. epsilon), and in practical applications, it may also be a convergence condition such as that the number of loop iterations reaches a preset number (e.g. 100), which is not limited in this respect.

The optimal user risk intervention intensity transfer strategy generated in the previous period is input into a risk system to be used in real time, and the wind control system dynamically adjusts the risk intervention intensities of all the requesting users in the current period according to the strategy, so that the local optimal user risk intervention intensity and the overall optimal wind control system intervention intensity are realized.

In practical application, a policy updating period θ and an updating threshold μmay be set, and the wind control system will be started once every period θ and μ are abnormal, that is, the process of updating the rating and updating the action policy is repeatedly performed. And the intelligent agent supplies a set of optimal strategies formed by learning to the wind control system after the system operation is finished each time.

In practical application, effect evaluation can be performed once every operation for a period of time, evaluation dimensionality comprises fraud transaction interception effectiveness A and disturbance rate D, and parameters mu of state quantity and state transition probability are adjusted according to evaluation results ₁ And mu ₂ And reward parameters, so that the system is optimized towards the direction that the fraudulent transaction interception effectiveness A is continuously improved and the disturbance rate D is continuously reduced.

In summary, the risk rating method provided by the embodiment of the present disclosure has the following significant advantages:

firstly, accuracy of user risk rating is improved. And regularly collecting user states including information such as risk characteristics of the user, user feedback data and the like based on a reinforcement learning algorithm, and learning and generating an optimal state transition strategy based on the change of the user states from t to t'.

And secondly, dynamically adjusting the risk intervention intensity of the user in the next time period based on the optimal strategy, so that the fitness of the risk rating result of the user in each time period and the latest state and risk characteristics of the user is obviously improved, and the accuracy of the rating result is higher.

Thirdly, stability of risk transaction identification capability is improved. The transaction behaviors, risk characteristics and other conditions of all users in the product are analyzed regularly, and therefore the user state transition strategy updated in each period is learned and generated on the basis of the latest characteristics of the whole user. The influence of various periodic events such as product marketing, festival effect and the like on the wind control system is smoothed to a certain extent, the abnormal fluctuation of the triggering quantity of the ground risk strategy caused by normal periodic events is reduced, and the capability of the wind control system for resisting various abnormal interferences is further enhanced.

Fourthly, the risk identification accuracy is improved, and the disturbance rate to normal transactions is reduced. The method and the device optimize and update the user state transition strategy, update the risk rating of all users in the system in the current period after the new strategy is generated, update the risk intervention strength of the users, calculate the user risk characteristics in real time and intervene the fund behavior of the users by using the risk intervention strength matched with the users, and realize strong transaction intervention of high-risk users and weak transaction intervention of low-risk users.

In the description of the present specification, reference to a description of the term "some possible embodiments," "some embodiments," "examples," "specific examples," or "some examples" or the like means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present disclosure. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the various embodiments or examples and features of the various embodiments or examples described in this specification can be combined and combined by those skilled in the art without contradiction.

With regard to the method flow diagrams of the disclosed embodiments, certain operations are described as different steps performed in a certain order. Such flow diagrams are illustrative and not restrictive. Certain steps described herein may be grouped together and performed in a single operation, certain steps may be separated into sub-steps, and certain steps may be performed in an order different than presented herein. The various steps shown in the flowcharts may be implemented in any way by any circuit structure and/or tangible mechanism (e.g., by software running on a computer device, hardware (e.g., logical functions implemented by a processor or chip), etc., and/or any combination thereof).

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

Based on the same inventive concept, a risk rating device corresponding to the risk rating method is also provided in the embodiments of the present disclosure, and as the principle of solving the problem by the device in the embodiments of the present disclosure is similar to the risk rating method in the embodiments of the present disclosure, the implementation of the device may refer to the implementation of the method, and the repeated parts are not described again.

Referring to fig. 3, a schematic diagram of a risk rating apparatus provided in an embodiment of the present disclosure is shown, the apparatus includes: an acquisition module 301, a determination module 302 and an update module 303; wherein,

an obtaining module 301, configured to respond to a current service operation instruction of a target user, and obtain service operation information corresponding to the service operation instruction;

a determining module 302, configured to determine, based on the service operation information, a risk feature factor for evaluating the size of the service risk and a user feedback factor for evaluating the size of the user feedback risk; inputting the initial risk level of the target user, the determined risk characteristic factor and the user feedback factor into a pre-trained reinforcement learning model, and determining a risk level change value output by the model;

the updating module 303 is configured to update the initial risk level of the target user based on the risk level change value, so as to obtain an updated risk level.

By adopting the risk rating device, the business operation information corresponding to the business operation instruction can be obtained in response to the current business operation instruction of the target user, then the risk characteristic factor for evaluating the business risk and the user feedback factor for evaluating the user feedback risk are determined based on the business operation information, the risk grade change state of the target user can be predicted based on the information and the pre-trained reinforcement learning model, and finally the initial risk grade of the target user can be updated based on the predicted risk grade change value to obtain the updated risk grade. According to the method and the device, along with the updating of the user state, the risk level of the user can be adjusted in a self-adaptive mode by using the pre-trained reinforcement learning model, so that the updated risk level can be fitted with the latest state of the user state to a greater extent, and the accuracy of the rating result is higher.

In one possible embodiment, the risk profile includes one or both of the following characteristics:

the user feedback factor includes one or both of the following characteristics:

the system comprises a user display feedback feature for evaluating the risk of direct feedback of the user and a user implicit feedback feature for evaluating the risk of indirect feedback of the user.

In a possible embodiment, the above apparatus further comprises:

a training module 304, configured to train the reinforcement learning model according to the following steps:

acquiring multiple initial risk levels corresponding to multiple users; each user belongs to one of a plurality of initial risk levels;

determining risk characteristic factors and user feedback factors of each user under the initial risk level aiming at each initial risk level; each risk characteristic factor is used for evaluating the business risk of the corresponding user, and each user feedback factor is used for evaluating the user feedback risk of the corresponding user;

and taking the risk characteristic factors and the user feedback factors of each user under each initial risk level as input data of the reinforcement learning model to be trained, and carrying out at least one round of training on the reinforcement learning model to obtain the pre-trained reinforcement learning model.

In one possible implementation, the training module 304 is configured to perform at least one training round on the reinforcement learning model according to the following steps:

aiming at each initial risk level, taking risk characteristic factors and user feedback factors of each user under the initial risk level as input data of the reinforcement learning model to be trained, and determining updated risk levels and corresponding strategy reward values obtained after an agent in the reinforcement learning model executes next action according to a current action strategy;

and taking the risk characteristic factors and the user feedback factors of all users at the updated risk level as input data of the reinforcement learning model to be trained, and determining the updated risk level and the corresponding strategy reward value obtained after an intelligent agent in the reinforcement learning model executes the next action according to the updated action strategy.

In one possible implementation, the training module 304 is configured to determine an updated risk level and a corresponding policy reward value obtained after an agent in the reinforcement learning model performs a next target action according to a current action policy, according to the following steps:

determining a risk grade change value selected from a plurality of risk grade change values after an agent in the reinforcement learning model executes the next action according to the current action strategy;

In one possible implementation, the training module 304 is configured to determine an updated risk level and a corresponding policy reward value obtained after an agent in the reinforcement learning model performs a next action according to a current action policy, according to the following steps:

determining the state transition probability and action reward value of an agent in the reinforcement learning model from the next action to the updated risk level according to the current action strategy;

a policy reward value is determined based on a multiplication operation between the state transition probability and the action reward value.

the number of loop iterations reaches the preset number;

the difference value between the strategy reward and the value corresponding to the two circulations is smaller than the preset difference value; the strategy reward and the value corresponding to each circulation are determined by the sum of the strategy reward values corresponding to various risk levels when the next action is executed according to the corresponding action strategy.

In a possible embodiment, the above apparatus further comprises:

an intervention module 305, configured to determine a target wind control intervention strength value for the target user based on the updated risk level after obtaining the updated risk level; and determining a target wind control strategy corresponding to the target wind control intervention strength value based on the corresponding relation between each wind control intervention strength value and each wind control strategy.

In a possible implementation, the intervention module 305 is further configured to:

and feeding back the risk handling suggestion information in the target wind control strategy to a client of the target user.

It should be noted that the apparatus in the embodiment of the present disclosure may implement each process of the foregoing method embodiment, and achieve the same effect and function, which are not described herein again.

An embodiment of the present disclosure further provides an electronic device, as shown in fig. 4, which is a schematic structural diagram of the electronic device provided in the embodiment of the present disclosure, and the electronic device includes: a processor 401, a memory 402, and a bus 403. The memory 402 stores machine-readable instructions executable by the processor 401 (for example, execution instructions corresponding to the obtaining module 301, the determining module 302, and the updating module 303 in the apparatus in fig. 3, and the like), when the electronic device runs, the processor 401 and the memory 402 communicate through the bus 403, and when the machine-readable instructions are executed by the processor 401, the following processes are performed:

The disclosed embodiments also provide a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, performs the steps of the risk rating method described in the above method embodiments. The storage medium may be a volatile or non-volatile computer-readable storage medium.

Embodiments of the present disclosure further provide a computer program product, where the computer program product carries program codes, and instructions included in the program codes may be used to execute steps of the risk rating method in the foregoing method embodiments, which may be specifically referred to in the foregoing method embodiments, and are not described herein again.

The computer program product may be implemented by hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

The embodiments in the disclosure are described in a progressive manner, and the same and similar parts among the embodiments can be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, the description of the apparatus, device, and computer-readable storage medium embodiments is simplified because they are substantially similar to the method embodiments, and reference may be made to some descriptions of the method embodiments for related aspects.

The apparatuses, devices, and computer-readable storage media provided in the embodiments of the present disclosure correspond to the methods one to one, and therefore, the apparatuses, devices, and computer-readable storage media also have similar beneficial technical effects to the corresponding methods.

As will be appreciated by one of skill in the art, embodiments of the present disclosure may be provided as a method, apparatus (device or system), or computer-readable storage medium. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer-readable storage medium embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices or systems), and computer-readable storage media according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A method of risk rating, comprising:

2. The method of claim 1, wherein the risk characterization factors include one or both of the following features:

the behavior risk characteristic is used for evaluating the behavior risk of the user, and the transaction risk characteristic is used for evaluating the transaction risk;

the user feedback factor includes one or both of the following characteristics:

3. The method of claim 1 or 2, wherein the reinforcement learning model is trained according to the following steps:

determining risk characteristic factors and user feedback factors of users under each initial risk level; each risk characteristic factor is used for evaluating the business risk of the corresponding user, and each user feedback factor is used for evaluating the user feedback risk of the corresponding user;

4. The method of claim 3, wherein the performing at least one round of training on the reinforcement learning model comprises:

aiming at each initial risk level, taking risk characteristic factors and user feedback factors of each user under the initial risk level as input data of a reinforcement learning model to be trained, and determining updated risk levels and corresponding strategy reward values obtained after an agent in the reinforcement learning model executes next action according to a current action strategy;

and taking the risk characteristic factors and the user feedback factors of all users under the updated risk level as input data of the reinforcement learning model to be trained, and determining the updated risk level and the corresponding strategy reward value obtained after an intelligent agent in the reinforcement learning model executes the next action according to the updated action strategy.

5. The method of claim 4, wherein determining updated risk levels and corresponding policy award values for an agent in the reinforcement learning model after performing a next action according to a current action policy comprises:

6. The method of claim 4 or 5, wherein the determining updated risk level and corresponding policy reward value obtained after the agent in the reinforcement learning model executes the next action according to the current action policy comprises:

determining the state transition probability and action reward value of an agent in the reinforcement learning model for executing the next action to the updated risk level according to the current action strategy;

7. The method according to claim 4 or 5, wherein the model convergence condition comprises one of the following conditions:

the number of loop iterations reaches the preset number;

the difference value between the strategy reward and the value corresponding to the two cycles is smaller than a preset difference value; the strategy reward and the value corresponding to each circulation are determined by the sum of the strategy reward values corresponding to various risk levels when the next action is executed according to the corresponding action strategy.

8. The method of claim 1, wherein after said obtaining an updated risk level, said method further comprises:

9. The method of claim 8, further comprising:

10. A risk rating apparatus, comprising:

the acquisition module is used for responding to a current business operation instruction of a target user and acquiring business operation information corresponding to the business operation instruction;

11. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory communicating over the bus when the electronic device is operating, the machine-readable instructions when executed by the processor performing the risk rating method of any of claims 1 to 9.

12. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the risk rating method according to any one of claims 1 to 9.