CN116151890A

CN116151890A - Multi-risk insurance set recommendation control method and device based on game algorithm

Info

Publication number: CN116151890A
Application number: CN202310397529.0A
Authority: CN
Inventors: 王磊; 漆舒汉; 宋健; 肖雁飞; 吴颖楠; 李娜
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-04-14
Filing date: 2023-04-14
Publication date: 2023-05-23
Anticipated expiration: 2043-04-14
Also published as: CN116151890B

Abstract

The application provides a multi-risk insurance set recommendation control method, a device, electronic equipment and a storage medium based on a game algorithm, which comprise the steps of obtaining observation information corresponding to each insurance and converting the observation information into an observation information vector; processing the observation information vector through a pre-trained macro decision network to obtain recommended actions of each insurance for a specific user; processing the observation information vector through a pre-trained microscopic decision network to obtain the annual recommended actions of each insurance for a specific user; and splicing the recommended actions and the annual recommended actions to obtain recommended control actions of each insurance, wherein the recommended control actions are used for controlling the recommendation of the target insurance of the corresponding age to the specific user. The combined recommendation control of corresponding years can be carried out on each insurance in the multi-risk insurance set through the macro decision network and the micro decision network, and the recommendation precision can be improved.

Description

Multi-risk insurance set recommendation control method and device based on game algorithm

Technical Field

The application relates to the field of insurance product recommendation, in particular to a multi-risk insurance set recommendation control method and device based on a game algorithm, electronic equipment and a storage medium.

Background

The specificity of insurance products determines that the recommendation model for insurance products is quite different from the recommendation model for internet retail. Traditional insurance selling is generally to make a recommendation of related insurance products under the condition of fully knowing the needs of clients after a professional sales consultant and the clients face to face and one to one and make a "hope and smell inquiry", so that selling is hoped to be achieved. Thus, the end result of this process is many times dependent on the professional level and business capabilities of the sales consultant or marketer. Currently, the related art has the following drawbacks:

(1) The non-learning strategy based on the rules mainly performs action selection according to the established behavior rules, and mainly comprises an expert system, a differential countermeasure method and the like. However, expert systems can only be used to solve known problems, which still require human involvement. The differential countermeasure method has large calculated amount and is not suitable for real-time decision;

(2) Artificial intelligence algorithms such as artificial immune systems, genetic algorithms, and approximate dynamic programming are also widely used in this field. These methods apply a hierarchical coding scheme to define the behavior of multiple interrelations. However, the complex rule base utilized by the genetic algorithm is excessively dependent on artificial subjective setting in terms of coding scheme and genetic evolution mode, and does not have generalization. The artificial immune system has large calculation amount and slow convergence speed, and the deduced solution is too single. The sample requirement of the approximate dynamic programming on the early-stage discretized track sampling can be improved in an exponential explosion manner, and the response speed can be influenced by the calculated amount in the later stage;

(3) Single agent deep reinforcement learning algorithms have also been attempted for use in this field, which use a value approximation network to approximate the action value of the insurance in a continuous state space and select the current best strategy based on the action value. However, the single agent deep reinforcement learning algorithm is difficult to expand into the environment of multi-agent cooperation, and the cooperation of multiple dangerous insurance is difficult to realize.

Disclosure of Invention

The embodiment of the application mainly aims to provide a multi-risk insurance set recommendation control method and device based on a game algorithm, electronic equipment and a storage medium. The combined recommendation control of corresponding years can be carried out on each insurance in the multi-risk insurance set through the macro decision network and the micro decision network, and the recommendation precision can be improved.

In order to achieve the above objective, a first aspect of an embodiment of the present application provides a multi-risk insurance set recommendation control method based on a game algorithm, where the method includes:

obtaining observation information corresponding to each insurance, and converting the observation information into an observation information vector, wherein the observation information comprises insurance related information, user behavior information and user attribute information;

Processing the observation information vector through a pre-trained macro decision network to obtain recommended actions of each insurance for a specific user, wherein the recommended actions comprise actions recommended to the specific user and actions not recommended to the specific user;

processing the observation information vector through a pre-trained microscopic decision network to obtain the annual recommended actions of each insurance for a specific user;

and splicing the recommended actions with the annual recommended actions to obtain recommended control actions of each insurance, wherein the recommended control actions are used for controlling the recommendation of the target insurance of the corresponding year to the specific user.

In some embodiments, after obtaining the observation information corresponding to each insurance, the method further includes:

detecting whether the observed information is changed according to a preset time period;

and updating the observation information when detecting that the observation information is changed.

In some embodiments, the macro decision network includes an action-value network of insurance, a hybrid network of insurance sets, and a super network, and the processing the observation information vector by the pre-trained macro decision network to obtain recommended actions of each insurance for a specific user includes:

Acquiring the observation information vector through the action-value network of the insurance, and calculating to obtain a local action cost function of each insurance;

acquiring the local action cost function of each insurance through the mixed network of the insurance set, and calculating to obtain the joint action cost function of the insurance set;

and solving the joint action cost function of the insurance set through the super network, and calculating to obtain recommended actions of each insurance for a specific user.

In some embodiments, training the macro decision network comprises:

initializing an action-value network of the insurance, a hybrid network of an insurance set, and a super network;

providing non-negative parameters through the super-network such that the joint action cost function of the insurance set satisfies monotonicity constraints;

determining a loss function according to the combined action cost function of the insurance set and the rewarding information, wherein the rewarding information is feedback information responding to the recommended control action;

And according to the loss function, obtaining the macro decision network through back propagation updating network parameter training.

In some embodiments, the expression of the determined loss function from the joint action cost function of the bonus information and the insurance set is:

；

wherein ,

；

in the formula ,

indicating the number of sample batches, +.>

Representing a loss function->

Representing the time series differential learning target,

joint action cost function representing insurance set, +.>

Representing the joint status of insurance sets +.>

Combined action representing insurance set ++>

Representing a joint action cost function for fitting insurance sets>

Parameters of the neural network,/->

Indicating a transient reward->

Representing discount factors, < >>

Representation ofParameters of the target network->

Representing a difference from +.>

Is the joint state of insurance sets +.>

Representing a difference from +.>

Is a joint action of insurance sets.

In some embodiments, the micro decision network comprises a feature extraction network and a multi-layer perceptron, and training the micro decision network comprises:

initializing the feature extraction network and the multi-layer perceptron;

collecting a plurality of annual recommended action tracks of each insurance oriented to a specific user as expert data, and calculating to obtain state distribution and state-annual recommended action distribution of an expert strategy;

Extracting features of the observed information vector through the feature extraction network to obtain a first feature;

processing the first characteristics through the multi-layer perceptron to obtain the annual recommended actions of each insurance facing a specific user, and calculating the annual recommended action distribution of each insurance facing the specific user;

according to the state distribution, state-age recommended action distribution and age recommended action distribution of each insurance facing a specific user of the expert strategy, a behavior cloning method is used to minimize the difference of the age recommended action distribution between the expert strategy and each insurance, and network parameter training is updated to obtain a micro decision network.

In some embodiments, minimizing the age-recommended action distribution difference between the expert policy and each insurance is performed by the following formula using a behavioral cloning method based on the state distribution, the state-age recommended action distribution, and the age recommended action distribution of each insurance for the specific user of the expert policy:

；

wherein ,

；

；

in the formula ,

status distribution representing expert policy, ++>

Indicate time of day->

Indicate->

Discount factor of time of day->

Representing status distribution- >

Indicating the state of insurance->

Representing the action of insurance, the->

Indicating insurance +.>

Status of moment->

Indicating the initial state of insurance->

Representing an initial state distribution->

Indicating insurance in->

Action of moment->

Representing expert policies;

status-age recommended action distribution representing expert policy, +.>

Representation->

Divergence (f)>

Representing a learning strategy.

To achieve the above object, a second aspect of the embodiments of the present application proposes a multi-risk insurance set recommendation control device based on a gaming algorithm, the device including:

the system comprises an acquisition module, a storage module and a display module, wherein the acquisition module is used for acquiring observation information corresponding to each insurance and converting the observation information into an observation information vector, and the observation information comprises insurance related information, user behavior information and user attribute information;

the first processing module is used for processing the observation information vector through a pre-trained macro decision network to obtain recommended actions of each insurance for a specific user, wherein the recommended actions comprise actions recommended to the specific user and actions not recommended to the specific user;

the second processing module is used for processing the observation information vector through a pre-trained micro decision network to obtain the annual recommended action of each insurance for a specific user;

And the splicing module is used for splicing the recommended actions with the annual recommended actions to obtain recommended control actions of each insurance, wherein the recommended control actions are used for controlling the recommendation of the target insurance of the corresponding age to the specific user.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, the memory storing a computer program, the processor implementing the method according to the first aspect when executing the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect.

The application provides a multi-risk insurance set recommendation control method and device based on a game algorithm, electronic equipment and storage medium, wherein the control method comprises the following steps: obtaining observation information corresponding to each insurance, and converting the observation information into an observation information vector, wherein the observation information comprises insurance related information, user behavior information and user attribute information; the observation information vector is processed through a pre-trained macro decision network, and recommended actions of each insurance for a specific user are obtained, wherein the recommended actions comprise actions recommended to the specific user and actions not recommended to the specific user; processing the observation information vector through a pre-trained microscopic decision network to obtain the annual recommended actions of each insurance for a specific user; and splicing the recommended actions and the annual recommended actions to obtain recommended control actions of each insurance, wherein the recommended control actions are used for controlling the recommendation of the target insurance of the corresponding age to the specific user. The combined recommendation control of corresponding years can be carried out on each insurance in the multi-risk insurance set through the macro decision network and the micro decision network, and the recommendation precision can be improved.

Drawings

Fig. 1 is a flowchart of a multi-risk insurance set recommendation control method based on a gaming algorithm according to an embodiment of the present application.

Fig. 2 is a flowchart of steps for processing an observation information vector through a pre-trained macro decision network to obtain recommended actions of each insurance for a specific user according to an embodiment of the present application.

FIG. 3 is a schematic diagram of an action-value network of insurance provided in an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a hybrid network of insurance sets provided in an embodiment of the present application.

Fig. 5 is a flowchart of steps for training a macro decision network provided by an embodiment of the present application.

Fig. 6 is a flowchart of steps for training a micro decision network provided by an embodiment of the present application.

Fig. 7 is a schematic structural diagram of a multi-risk insurance set recommendation control device based on a game algorithm according to an embodiment of the present application.

Fig. 8 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.

In the embodiments of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards of related countries and regions. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.

With the development of unmanned and intelligent technologies, how to automatically and efficiently complete product recommendation tasks has become an important subject with great practical significance. The current mainstream recommendation mode of insurance product set recommendation is to collect information of all insurance by expert, and distribute recommendation strategies of different insurance types to each insurance after manual or arithmetic processing calculation aiming at different customer characteristics. The time consumed by the centralized processing method is increased along with the increase of the insurance quantity, so that the formulation delay of the insurance combination strategy is larger, and the result of poor application effect of the insurance product set recommendation task is finally caused.

In recent years, with the rapid development of Deep Q-Learning (Deep Q-Learning), deep reinforcement Learning technology has made great progress and breakthrough, and has been widely explored and applied in fields such as the control field of mechanical arms and robots, the game field of yadali games and weiqi, multi-turn dialogue systems, recommendation systems, and the like. The deep reinforcement learning technology trains out intelligent agents with autonomous continuous decision making capability by trial and error and rewards, so that each insurance of the insurance set is trained by using a multi-intelligent-agent reinforcement learning algorithm in the deep reinforcement learning technology, and autonomous control of each insurance is realized so as to conduct targeted recommendation according to the characteristics of different users in simulation.

The existing multi-agent reinforcement learning algorithm is usually a multi-agent actor-evaluator algorithm that is centrally trained and executed in a decentralized manner. In the simulation of completing a given task, the insurance set is composed of a plurality of insurance with cooperative "will", i.e. the situation that the user subjectively feels that there is an insurance combination, for example, buying a package, it is more suitable for customers to buy in combination, and the insurance recommendation is performed on the users in combination. When the algorithm is applied to insurance product set control, the decision granularity of the algorithm under the decentralized execution framework is thicker due to centralized training, and recommendation can not be accurately performed on insurance user characteristics.

Meanwhile, in the current scheme of carrying out targeted recommendation based on the user characteristics, a plurality of insurance most interested by the user is recommended in a recommendation list mode only based on the user characteristics, and the plurality of insurance is not recommended in a combined mode or is not recommended together with years.

Based on the above, the embodiment of the application provides a multi-risk insurance set recommendation control method based on a game algorithm, and the game is carried out through both a macro decision network and a micro decision network, so that corresponding insurance combinations can be recommended for user characteristics. Specifically, the recommended actions of each insurance for the specific user can be obtained through a pre-trained macro decision network, the annual recommended actions of each insurance for the specific user can be obtained through a pre-trained micro decision network, then the recommended actions and the annual recommended actions are spliced after game play, and the recommended control actions of each insurance are obtained, so that the finally obtained recommended control actions are more accurate, the combined recommendation of each insurance and the annual can be accurately carried out, and the recommendation precision can be improved.

Referring to fig. 1, fig. 1 is a flowchart of a multi-risk insurance set recommendation control method based on a gaming algorithm according to an embodiment of the present application, and the method in fig. 1 may include, but is not limited to, steps S101 to S104.

Step S101, obtaining the observation information corresponding to each insurance, and converting the observation information into an observation information vector, wherein the observation information comprises insurance related information, user behavior information and user attribute information.

In the embodiment of the application, firstly, the insurance data and the user data corresponding to each insurance are used for constructing an analog simulation platform recommended by the insurance set. The insurance data comprise insurance contents, participating conditions, participating modes, participating welfare and the like corresponding to each insurance. The user data corresponding to the insurance includes user behavior data and user attribute data. Specifically, the user behavior data corresponding to each insurance is all operations of the user on each insurance product, such as browsing, clicking, playing, purchasing, searching, collecting, praying, forwarding, shopping cart adding, even sliding, stay time at a certain position, fast forwarding and the like. The user is based on feedback that the operational behavior of each insurance product is the most realistic intent of the user, which reflects the user's interest state in each insurance product, and by analyzing the user behavior, a profound insight into the user's interest preference can be obtained. User behavior is generally classified into explicit behavior and implicit behavior depending on whether the user's behavior directly indicates the user's preference for interest in the target insurance product. Explicit behavior is behavior that directly indicates the user's interests, such as praise, scoring, etc. Implicit actions, while not directly representing the user's interests, may indirectly feed back the user's interest changes, so long as the user does not score directly, click, play, collect, comment, forward, etc., all operate to calculate implicit feedback. User attribute data, also called user demographic data, is the attributes that the user has with him/herself, such as age, gender, territory, academic, income, family composition, occupation, etc. These data are generally steady (e.g., gender) or slowly (e.g., age). The human being is a socialization species, different attributes of the user determine that the user is in different levels or life circles, the different levels or life circles have different behavior characteristics, life patterns and preference characteristics, and the same circle layer has certain similarity, so that the similarity provides certain guarantee for personalized recommendation. The simulation platform can directly acquire observation information corresponding to each insurance, wherein the observation information comprises insurance related information, user behavior information and user attribute information. Here, the insurance related information refers to insurance content, participation conditions, participation modes, participation benefits, and the like, the user behavior information is the user behavior data, and the user attribute information is the user attribute data. Then, the simulation platform can screen and process the observation information, and convert the coarse-granularity observation information into effective observation information vectors suitable for algorithm training.

By way of example, 100 insurance sets are integrated together to form an insurance set, and the simulation platform needs to acquire insurance content, participating conditions, participating modes, participating welfare and the like corresponding to each insurance in the insurance set, and also needs to acquire user behavior data and user attribute data corresponding to each insurance. And obtaining the observation information corresponding to each insurance. Because the obtained observation information contains repeated or useless information, the observation information needs to be screened and processed, and the coarse-granularity observation information is converted into an effective observation information vector suitable for algorithm training. For example, the user 1 has a collection record, a large number of click records, a browsing record and the like on the insurance a, and the information characterizes that the user 1 is interested in the insurance a, and the information needs to be processed and converted to obtain an effective information vector suitable for algorithm training.

It should be noted that, based on the characteristics of the user, the combined recommendation of the corresponding years for each insurance in the insurance set is performed based on the obtained observation information, so that the accuracy of the observation information may ultimately affect the accuracy of the recommendation. In consideration of the fact that the observation information, namely the insurance related information, the user behavior information and the user attribute information is relatively stable, namely the observation information is generally not changed greatly in a short time, whether the observation information is changed or not is detected according to a preset time period, and when the change of the observation information is detected, the observation information is updated. For example, the method is to detect whether insurance related information such as participation mode, participation condition or participation benefit changes or not at the same time, and detect user behavior information and user attribute information such as whether the operation behavior of some insurance products changes or not and whether the user income changes or not in the last period of time. When a change is detected, the changed information is updated to the previously acquired observation information. By regular updating of the observation information, the accuracy of the subsequent recommendation can be ensured.

Step S102, the observed information vector is processed through a pre-trained macro decision network to obtain recommended actions of each insurance for the specific user, wherein the recommended actions comprise actions recommended to the specific user and actions not recommended to the specific user.

In the embodiment of the application, the recommended actions of each insurance for the specific user can be obtained by processing the observed information vector through the pre-trained macro decision network, wherein the recommended actions comprise actions recommended to the specific user and actions not recommended to the specific user.

By way of example, after the observation information vector is processed through the pre-trained macro decision network, because interest preferences of different users on each insurance product in the insurance set can be obtained from the observation information, recommended actions of each insurance for a specific user can be generated. For example, containing A, B, C, D, E insurance products in the insurance set, based on the observed information, it may be determined to recommend insurance a and insurance B to user 1, but not insurance C, insurance D, and insurance E. Insurance a and insurance C are recommended to user 2, while insurance B, insurance D, and insurance E are not recommended. Insurance D and insurance E are recommended to user 3, while insurance a, insurance B and insurance C are not recommended. I.e., each insurance product has a corresponding recommended action or non-recommended action for each user. As for insurance a, a recommended action is generated for a particular user 1, a recommended action is generated for a particular user 2, and an un-recommended action is generated for a particular user 3. Likewise, for insurance B, a recommended action would be generated for particular user 1, an un-recommended action would be generated for particular user 2, and an un-recommended action would be generated for particular user 3. In this way, the recommended actions of each insurance for a particular user can be generated by processing the observed information vector through a pre-trained macro decision network.

Referring to fig. 2, fig. 2 is a flowchart of steps provided in the embodiments of the present application for processing observation information vectors through a pre-trained macro decision network to obtain recommended actions of each insurance for a specific user, including, but not limited to, steps S201 to S203.

Step S201, obtaining an observation information vector through an action-value network of the insurance, and calculating to obtain a local action cost function of each insurance;

step S202, obtaining local action cost functions of each insurance through a mixed network of the insurance sets, and calculating to obtain joint action cost functions of the insurance sets;

and step S203, solving the combined action cost function of the insurance set through the super network, and calculating to obtain recommended actions of each insurance for a specific user.

In the embodiment of the application, the macro decision network comprises an action-value network of insurance, a hybrid network of insurance sets and a super network. Referring to fig. 3, fig. 3 is a schematic diagram of an action-value network of insurance provided in an embodiment of the present application. In the view of figure 3 of the drawings,

is->

Action cost function of individual insurance +.>

Is->

The personal insurance is->

Status information of time,/->

Is->

The personal insurance is->

Status information of time. Wherein the action cost function is represented in In this insurance specific state, a total prize acquired by an action (such as recommended action or non-recommended action) is performed. For example, for insurance a, a prize value that may be obtained after performing an action recommended to a particular user may be calculated, and a prize value that may be obtained after performing an action not recommended to a particular user may be calculated. The status information indicates the current situation of the insurance, such as whether the insurance has been recommended, the current insurance content of the insurance, the current price of the insurance, the current participation condition of the insurance, the current participation mode of the insurance, the current participation benefit of the insurance, and the like. The GRU is a gated loop unit that uses a multi-layer perceptron to extract features and construct an action cost function.

In this embodiment of the present application, the action-value network of the insurance as shown in fig. 3 first obtains the observation information vector, and then calculates the local action cost function of each insurance according to the observation vector. For example, the insurance set contains A, B, C, D, E insurance products, and the action-value network of the insurance can calculate a local action value function of the insurance a according to the obtained observation information corresponding to the 5 insurance products, that is, calculate a reward value obtained by executing an action recommended to a specific user in the current state of the insurance a, and execute a reward value obtained by executing an action not recommended to the specific user. And then, according to the greedy strategy, the optimal action selection is carried out, and whether the recommendation of the insurance A to the specific user is selected or not can be determined. Likewise, a local action cost function of insurance B may be calculated, i.e., a prize value obtained by performing an action recommended to a particular user in the current state of insurance B, and a prize value obtained by performing an action not recommended to a particular user. And then, according to the greedy strategy, the optimal action selection is carried out, and whether the recommendation of the insurance B to the specific user is selected or not can be determined. In this manner, recommended actions for each insurance product for a particular user may be initially determined.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a hybrid network of insurance sets provided in an embodiment of the present application, as shown in fig. 4,

joint movement for insurance setAnd (5) performing a price function. The non-negative parameters of the hybrid network of the insurance set are provided by the supernetwork to satisfy monotonicity constraints, namely:

；

after the monotonicity constraint is met, the optimal joint action is taken for the joint action cost function of the insurance set

Equivalent to taking the optimal action for the local action cost function of each insurance agent>

That is:

；

wherein ,

for the joint action cost function of the insurance set, +.>

For each insured individual action cost function, < ->

For the observation information vector of the insured individual, +.>

Is the joint state of the insurance set. Here, the joint action means a set of actions that each insurance in the insurance set performs separately. The joint state represents a set of current states of each insurance in the insurance set, for example, the current state of the insurance A in the insurance set comprises a current price, a participating and protecting mode, participating and protecting welfare and the like, and the current state of the insurance B comprises a current price, a participating and protecting mode, participating and protecting welfare and the like. Through the joint action cost function of the insurance set, the corresponding recommended action or non-recommended action of each insurance in the insurance set can be calculated The total awards acquired. The rewards obtained by each insurance individual for executing recommended actions or non-recommended actions can be calculated through the action cost function of each insurance individual.

In the embodiment of the application, the action-value network based on insurance can only determine the recommended action of each insurance product for a specific user, but considering that a plurality of insurance products are recommended after being grouped, the optimal combined recommended action in a combined scheme needs to be considered, so that the combined action value function of the insurance set needs to be calculated and obtained further through the hybrid network of the insurance set shown in fig. 4, and then the combined action value function of the insurance set is solved through the super network, so that the recommended action of each insurance for the specific user is calculated and obtained.

Illustratively, the insurance-based action-value network calculates that the prize value earned by insurance a performing the recommended action to user 1 is greater, thus determining to recommend insurance a to user 1. Likewise, the prize value obtained by calculating that insurance B performs the action recommended to user 1 is larger, and thus it is determined to recommend insurance B to user 1. The prize value obtained by calculating that insurance C performs the action recommended to user 1 is larger, and thus it is determined to recommend insurance C to user 1. The prize value obtained by calculating the action that insurance D performs to recommend to user 1 is larger, and thus it is determined to recommend insurance D to user 1. The prize value obtained by calculating the action of the insurance E to be recommended to the user 1 is larger, and thus it is determined to recommend the insurance E to the user 1. That is, if only the result of the processing of the action-value network of insurance is based, then the combination of insurance a, insurance B, insurance C, insurance D, and insurance E would be recommended to user 1. But such a decision may not be reasonable. Thus, further processing through the hybrid network of insurance sets is required. That is, the hybrid network of the insurance set needs to determine the joint action cost function of the insurance set, and then calculate the optimal joint action according to the joint action cost function. For example, it is calculated that the obtained prize value is maximum only in the case where the insurance a and the insurance C are recommended to the user 1, and the insurance B, the insurance D, and the insurance E are not recommended to the user 1. At this time, the optimal combination action is an action of insurance a to perform recommendation to the user 1, an action of insurance B to perform recommendation not to the user 1, an action of insurance C to perform recommendation to the user 1, an action of insurance D to perform recommendation not to the user 1, and an action of insurance E to perform recommendation not to the user 1. At this time, the insurance combination scheme recommended to the user 1 is to recommend the user 1 to purchase both the insurance a and the insurance C in the insurance set. Based on this, recommended actions for each insurance for a particular user may be determined.

It should be noted that, the recommended insurance combination scheme is different based on different users. Of course, when the user data is substantially consistent, that is, the attribute information of the user is substantially consistent, the behavior data of the user on each insurance is substantially consistent, the recommended insurance combination scheme may be the same. In this embodiment, after the insurance a performs the corresponding recommended action to the user 1, the obtained reward may be represented by detecting whether the user 1 clicks or browses or purchases the insurance a. For example, after recommending the insurance a to the user 1, it is detected that the user 1 purchases the insurance a, and the prize value is assigned as a. And detecting that the user 1 clicks or browses the content of the insurance A, and assigning the reward value as B. And detecting that the user clicks to be no longer recommended, and assigning a reward value as C.

Referring to fig. 5, fig. 5 is a flowchart of steps for training a macro decision network provided in an embodiment of the present application, including but not limited to steps S501 to S506.

Step S501, initializing action-value network of insurance, hybrid network and super network of insurance set;

step S502, obtaining an observation information vector through an action-value network of the insurance, and calculating to obtain a local action cost function of each insurance;

Step S503, obtaining local action cost functions of each insurance through a mixed network of the insurance sets, and calculating to obtain joint action cost functions of the insurance sets;

step S504, providing non-negative parameters through a super network so that the joint action cost function of the insurance set meets monotonicity constraint;

step S505, determining a loss function according to the combined action cost function of the rewarding information and the insurance set, wherein the rewarding information is feedback information responding to the recommended control action;

step S506, according to the loss function, the macro decision network is obtained through the backward propagation updating network parameter training.

In the embodiment of the application, the macro decision network comprises an action-value network of insurance, a hybrid network of insurance sets and a super network. When training the macro decision network, random initialization is needed for the action-value network of insurance, the hybrid network of insurance set and the super network. It will be appreciated that training of the macro decision network is also performed on an analog simulation platform. Similarly, the simulation platform can acquire observation information and rewarding information corresponding to each insurance. Wherein the bonus information is feedback information in response to the recommended control action. For example, after recommending the insurance a to the user 1, whether the user 1 clicks or browses or purchases the operation behavior of the insurance a may be detected, which may be feedback information, and then the feedback information may be corresponding to the corresponding rewards information. The simulation platform converts the observed information into effective observed information vectors, and then the obtained effective observed information vectors are input into an insurance action-value network to obtain a local action value function of each insurance, and the insurance action-value network performs optimal action selection according to a greedy strategy.

After obtaining the local action cost function of each insurance, the local action cost function of each insurance is further input into the mixed network of the insurance set, and the joint action cost function of the insurance set is obtained through calculation. And then, constructing a loss function according to the joint action cost function of the reward information and the insurance set, updating the network weight through back propagation according to the loss function, and repeating training until reaching the end condition, and outputting the trained macro decision network. The end condition includes reaching a preset number of iterations and loss function convergence. The constructed loss function is as follows:

；

in the formula ,

represents the number of sample batches sampled from the experience playback pool, +.>

Representing the value function of the joint action,

representing the joint status of insurance sets +.>

Combined action representing insurance set ++>

Representing a value function for fitting joint actions>

Parameters of the neural network,/->

Representing a time-series differential learning objective, by transient rewards +.>

And an optimal action value, wherein the calculation formula is as follows:

；/>

in the formula ,

indicating a transient reward->

Representing a discount factor for calculating a cumulative discount prize,/->

Parameters representing the target network for alleviating the problem of overestimation of the cost function +. >

Representing a difference from +.>

Is the joint state of insurance sets +.>

Representing a difference from +.>

Is a joint action of insurance sets. Wherein the target network is a neural network having the same structure as the action-value network, and can alleviate the too frequent fluctuation (update) of the action-value network.

It should be noted that, in the training process of the macro decision network, the recommendation effect may be determined by acquiring feedback information after each insurance performs a corresponding recommendation action to a specific user. For example, after the insurance a executes the recommended action to the user 1, the simulation platform may acquire corresponding feedback information, for example, it may be determined that the recommended effect corresponding to the recommended action in this iteration is better when it is detected that the user 1 collects the insurance a. After the recommended effect in a certain iteration number reaches the preset effect, the model can be determined to be trained.

It can be understood that in the training process of the macro decision network, each iteration, the simulated simulation environment provides different observation information corresponding to each insurance, so that the macro decision network can generate different recommended actions of each insurance for a specific user according to the different observation information. Thereby achieving the purpose of training.

In the embodiment of the application, the macro decision network is trained based on a monotonic value function decomposition algorithm, that is, the macro decision network is learned by using the monotonic value function decomposition algorithm. The single-value function decomposition algorithm belongs to a multi-agent deep reinforcement learning algorithm, and can autonomously make decisions according to environmental states, so that the cost of manual decision and the risk of decision errors can be reduced.

It can be understood that in the embodiment of the present application, the macro decision network mainly considers the recommendation types of the insurance sets, outputs the recommendation actions for the specific user, and performs the collaborative combination between the insurance from the overall recommendation policy.

And step S103, processing the observation information vector through a pre-trained micro decision network to obtain the age recommended action of each insurance for a specific user.

In the embodiment of the application, the observed information vector is processed through a pre-trained micro decision network to obtain the age recommended action of each insurance for a specific user. Specifically, the micro decision network comprises a feature extraction network and a multi-layer perceptron, and after the feature extraction network and the multi-layer perceptron are initialized, the observation information vector is input into the feature extraction network to obtain an implicit feature representation in the observation information. For example, some implicit actions in the user behavior information, while not directly representing the interests of the user, may indirectly feed back the interest changes of the user, including browsing, clicking, playing, collecting, commenting, forwarding, etc. The information may be extracted to obtain an implicit characteristic representation. The implicit feature representations are then input to a multi-tier perceptron to obtain age recommended actions for each insurance for the particular user. That is, more specific age recommended actions of each insurance for a specific user can be further obtained according to the implicit characteristic representation. For example, based on analysis of the observation information, the income of the user 1 is found to be higher, higher insurance expense can be borne, meanwhile, a plurality of insurance purchased by the user 1 is obtained from the insurance history purchase record of the user 1, and the purchase years corresponding to the insurance can be obtained, so that the insurance of the corresponding years can be recommended to the user 1 correspondingly. Such as recommending to user 1 to purchase insurance a for 5 years.

For example, after the observation information vector is processed through the pre-trained micro decision network, because interest preferences of different users on each insurance product in the insurance set and the insurance years possibly purchased can be obtained from the observation information, the years recommended action of each insurance for a specific user can be generated. For example, containing A, B, C, D, E total of 5 insurance products in the insurance set, based on the observation information, it can be determined that the age of the recommended insurance a to the user 1 is 5 years and the age of the insurance B is 10 years. The age of insurance a is recommended to user 2 as 3 years and the age of insurance C as 5 years. The user 3 is recommended to be insurance D for 10 years and insurance E for 2 years. I.e., each insurance product has a corresponding annual recommended action for each user. As for insurance a, a corresponding year recommendation action is generated for a particular user 1 and a corresponding year recommendation action is also generated for a particular user 2. In this way, the age recommended actions of each insurance for a particular user can be generated by processing the observed information vector through the pre-trained micro decision network.

Referring to fig. 6, fig. 6 is a flowchart of steps provided in an embodiment of the present application for training a micro decision network, including but not limited to steps S601 to S605.

Step S601, initializing a feature extraction network and a multi-layer perceptron;

step S602, collecting a plurality of annual recommended action tracks of each insurance oriented to a specific user as expert data, and calculating to obtain state distribution and state-annual recommended action distribution of an expert strategy;

step S603, extracting features of the observed information vector through a feature extraction network to obtain a first feature;

step S604, processing the first characteristics through a multi-layer perceptron to obtain the annual recommended actions of each insurance facing a specific user, and calculating the annual recommended action distribution of each insurance facing the specific user;

step S605, according to the state distribution, state-age recommended action distribution and age recommended action distribution of each insurance facing to a specific user, using a behavior cloning method to minimize the difference of the age recommended action distribution between the expert policy and each insurance, and updating the network parameters to train to obtain a micro decision network.

In an embodiment of the present application, the micro decision network includes a feature extraction network and a multi-layer perceptron. When training the micro decision network, the feature extraction network and the multi-layer perceptron need to be randomly initialized. It will be appreciated that the training of the micro decision network is also performed on an analog simulation platform. Firstly, a certain amount of high-quality age recommended action tracks of each insurance oriented to a specific user are collected as expert data, and state distribution and state-age recommended action distribution of expert strategies are calculated. The state distribution calculation formula of the expert strategy is as follows:

；

in the formula ,

indicate time of day->

Indicate->

Discount factor of time of day->

Representing status distribution->

Indicating the state of insurance->

Representing the action of insurance, the->

Indicating insurance +.>

Status of moment->

Indicating the initial state of insurance->

Representing an initial state distribution->

Indicating insurance in->

Action of moment->

Representing expert policy->

Representing the state distribution of expert policies, i.e. in the initial state +.>

Satisfy distribution->

Action->

Meet expert policy->

Condition of->

Is a cumulative discount probability expectation value. Here, the state of the insurance may include price of the insurance, participation condition of the insurance, participation mode, participation welfare, etc., and the action of the insurance includes an action recommended to a specific user and an action not recommended to the specific user.

The state-age recommended action distribution calculation formula of the expert strategy is as follows:

；

in the formula ,

state-age recommended action distribution representing expert strategy, i.e. in initial state +.>

Satisfy distribution->

Dynamic movementDo->

Meet expert policy->

Under the condition of>

Do action downwards +.>

Is a cumulative discount probability expectation value.

Similarly, the simulation platform acquires observation information corresponding to each insurance in the insurance set. Then converting the observed information into an effective observed information vector, inputting the obtained effective observed information vector into a feature extraction network, extracting features of the observed information vector by the feature extraction network to obtain implicit feature representation in the observed information, inputting the implicit feature representation into a multi-layer perceptron to obtain the age recommended action of each insurance, and then continuously calculating the age recommended action distribution of each insurance. And then according to the state distribution and the state-age recommended action distribution of the expert strategy and the age recommended action distribution of each insurance facing a specific user, using a behavior cloning method to minimize the action distribution difference between the expert strategy and each insurance, updating the network weight and repeating training to obtain the micro decision network. The calculation mode of the objective function of the behavior cloning method is as follows:

；

in the formula ,

representation->

Divergence (f)>

Indicates learning strategy>

Status distribution representing expert policy, ++>

Status-age recommended action distribution representing expert policy, +.>

Representing expert policy->

Indicating the state of insurance->

Indicating the action of the insurance.

In the embodiment of the application, the micro decision network is trained by using a behavior cloning method, namely, the expert strategy and the learning strategy are fitted by minimizing KL divergence, so that the distribution of the expert strategy and the learning strategy is similar as much as possible, and the purpose of simulating an expert is achieved. Because the behavior cloning method belongs to a simulated learning algorithm, more accurate control can be achieved through fitting of expert behavior tracks, so that insurance can make decisions similar to expert strategies.

In the embodiment of the application, the microscopic decision network mainly considers the specific purchasing years of the insurance products, outputs the years recommending action facing to the specific user, and makes a recommending decision from the single insurance product.

Step S104, splicing the recommended actions and the annual recommended actions to obtain recommended control actions of each insurance, wherein the recommended control actions are used for controlling the recommendation of the target insurance of the corresponding age to the specific user.

In the embodiment of the application, the recommended actions of each insurance for the specific user are generated through the macro decision network, and the annual recommended actions of each insurance for the specific user are generated through the micro decision network. There may be, among other things, a decision to recommend an insurance a to user 1 based on the macro decision network and not to recommend an insurance B to user 1, while a decision to recommend an insurance a to purchase 5 years and an insurance B to purchase 2 years based on the micro decision network. I.e. both recommended insurance a, but there is a conflict in the recommended actions for insurance B. Therefore, the two are spliced after game playing. Specifically, the recommended actions of the macro decision network for generating the insurance for the specific user and the annual recommended actions of the micro decision network for generating the insurance for the specific user are spliced after game play, so that the recommended control actions of the insurance are obtained, and the target insurance of the corresponding age can be recommended to the specific user. For example, determining by the macro decision network that insurance a and insurance C are recommended to user 1, determining by the micro decision network that insurance a is recommended to user 1 and the recommended purchase period is 5 years, insurance B is recommended to user 1 and the recommended purchase period is 5 years, insurance C is recommended to user 1 and the recommended purchase period is 5 years. At this time, after game-stitching the 2 sets of schemes, a final insurance combination recommendation scheme may be generated that recommends insurance a to user 1 and recommends a purchase age of 5 years, and recommends insurance C and recommends a purchase age of 5 years. At this time, the recommended control brake is used as control to recommend insurance a for 5 years and insurance C for 5 years to the user 1.

It will be appreciated that since both the macro decision network and the micro decision network generate corresponding recommended actions (or insurance combination recommendations) based on the same observed information, the types of insurance recommended to a particular user in the generated recommendations are not very different. For example, for user 1, the insurance combination recommendations generated through the macro decision network are recommended insurance a and insurance C, and the insurance combination recommendations generated through the micro decision network may be recommended insurance a, insurance B, and insurance C, i.e., they are similar. While there would be no insurance combination recommendations generated through the macro decision network as recommended insurance a and insurance C, the insurance combination recommendations generated through the micro decision network are distinct recommendations of recommended insurance B, insurance D, and insurance E. Therefore, the recommended actions of the insurance generated by the macro decision network and the annual recommended actions of the insurance generated by the micro decision network and oriented to the specific user can be effectively spliced.

Referring to fig. 7, the embodiment of the present application further provides a multi-risk insurance set recommendation control device 70 based on a game algorithm, which can implement the multi-risk insurance set recommendation control method, where the device includes:

The acquisition module 701 is configured to acquire observation information corresponding to each insurance, and convert the observation information into an observation information vector, where the observation information includes insurance related information, user behavior information, and user attribute information;

the first processing module 702 is configured to process the observation information vector through a pre-trained macro decision network, and obtain recommended actions of each insurance for a specific user, where the recommended actions include actions recommended to the specific user and actions not recommended to the specific user;

the second processing module 703 is configured to process the observation information vector through a pre-trained micro decision network, so as to obtain a year recommended action of each insurance for a specific user;

and the splicing module 704 is used for splicing the recommended actions and the annual recommended actions to obtain recommended control actions of each insurance, wherein the recommended control actions are used for controlling the recommendation of the target insurance of the corresponding age to the specific user.

The specific implementation of the multi-risk insurance set recommendation control device is basically the same as the specific embodiment of the multi-risk insurance set recommendation control method based on the game algorithm, and is not repeated here.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the multi-risk insurance set recommendation control method based on the game algorithm when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device includes:

the processor 801 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;

the memory 802 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). Memory 802 may store an operating system and other application programs, and when implementing the technical solutions provided in the embodiments of the present disclosure by software or firmware, relevant program codes are stored in memory 802, and processor 801 invokes a multi-risk insurance set recommendation control method based on a gaming algorithm to execute the embodiments of the present disclosure;

an input/output interface 803 for implementing information input and output;

the communication interface 804 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.), or may implement communication in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

A bus 805 that transfers information between the various components of the device (e.g., the processor 801, the memory 802, the input/output interface 803, and the communication interface 804);

wherein the processor 801, the memory 802, the input/output interface 803, and the communication interface 804 implement communication connection between each other inside the device through a bus 805.

The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the multi-risk insurance set recommendation control method based on the game algorithm when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A multi-risk insurance set recommendation control method based on a game algorithm is characterized by comprising the following steps:

2. The method of claim 1, wherein after obtaining the observation information corresponding to each insurance, the method further comprises:

3. The method of claim 1, wherein the macro decision network comprises an action-value network of insurance, a hybrid network of insurance sets, and a super network, wherein the processing of the observed information vector by the pre-trained macro decision network results in recommended actions of each insurance for a particular user, comprising:

4. A method according to claim 3, wherein training the macro decision network comprises:

5. The method of claim 4, wherein the expression for the determined loss function based on the joint action cost function of the bonus information and the insurance set is:

；/>

wherein ,

；

in the formula ,

indicating the number of sample batches, +.>

Representing a loss function->

Representing a time-series differential learning objective->

Joint action cost function representing insurance set, +.>

Representing the joint status of insurance sets +.>

Combined action representing insurance set ++>

Representing a joint action cost function for fitting insurance sets>

Parameters of the neural network,/->

Indicating a transient reward- >

Representing discount factors, < >>

Parameters representing the target network->

Representing a difference from +.>

Is the joint state of insurance sets +.>

Representing a difference from +.>

Is a joint action of insurance sets.

6. The method of claim 1, wherein the micro decision network comprises a feature extraction network and a multi-layer perceptron, and training the micro decision network comprises:

initializing the feature extraction network and the multi-layer perceptron;

7. The method of claim 6, wherein minimizing the age-recommended action distribution difference between expert policies and each insurance is performed by the following formula using a behavioral cloning method based on the state distribution, state-age recommended action distribution, and age recommended action distribution of each insurance for the particular user of the expert policies:

；

wherein ,

；

；

in the formula ,

status distribution representing expert policy, ++>

Indicate time of day->

Indicate->

Discount factor of time of day->

Representing status distribution->

Indicating the state of insurance->

Representing the action of insurance, the->

Indicating insurance +.>

Status of moment->

Indicating the initial state of insurance->

Representing an initial state distribution->

Indicating insurance in->

Action of moment->

Representing expert policy->

Status-age recommended action distribution representing expert policy, +.>

Representation->

Divergence (f)>

Representing a learning strategy.

8. A multi-risk insurance set recommendation control device based on a game algorithm, the device comprising:

9. An electronic device comprising a memory storing a computer program and a processor implementing the method of any of claims 1 to 7 when the computer program is executed by the processor.

10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the method of any one of claims 1 to 7.