WO2021208720A1

WO2021208720A1 - Method and apparatus for service allocation based on reinforcement learning

Info

Publication number: WO2021208720A1
Application number: PCT/CN2021/083817
Authority: WO
Inventors: 朱星华; 王健宗
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-11-19
Filing date: 2021-03-30
Publication date: 2021-10-21
Also published as: CN112381428B; CN112381428A

Abstract

A method and an apparatus for service allocation based on reinforcement learning, a device, and a storage medium, relating to the technical field of artificial intelligence, and used to improve the accuracy of service allocation. The method for service allocation based on reinforcement learning comprises: performing selection probability prediction on feature vector information that is based on private organizational structure data of a plurality of participants to obtain a selection probability; by means of the selection probability, sampling the feature vector information to obtain sample gradient information; on the basis of the sample gradient information, updating model parameters of a pre-set federated service evaluation model to obtain an updated federated evaluation model, and by means of the updated federated evaluation model, calculating a reward value; by means of a pre-set evaluator and the reward value, performing value evaluation on the feature vector information to obtain participant contributions; and on the basis of the participant contributions, performing service allocation to the plurality of participants to obtain participant service allocation information. In addition, the present method further relates to blockchain technology, and the private organizational structure data can be stored in a blockchain.

Description

Business distribution method, device, equipment and storage medium based on reinforcement learning

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on November 19, 2020, the application number is 202011298673.1, and the invention title is "Reinforcement learning-based business distribution methods, devices, equipment and storage media", and its entire contents Incorporated in the application by reference.

Technical field

This application relates to the field of machine learning of artificial intelligence, and in particular to a method, device, equipment, and storage medium for business distribution based on reinforcement learning.

Background technique

With the development of Internet technology and business, the management, control and invocation of business information by enterprises and institutions has also become the focus of attention of all enterprises and institutions. For example: The enterprise allocates business.

At present, in order to better allocate business to each sub-enterprise based on the business information provided by all the sub-enterprises of the enterprise, it is generally based on the data volume of the business information provided by the sub-enterprise to assess the business contribution of the sub-enterprise. The assessed business contribution is assigned to the sub-enterprise.

The inventor realizes that due to the differences in the quality of business information provided by different sub-enterprises, only assessing the business contribution of each sub-enterprise based on the amount of business information provided by the sub-enterprise is not only unconvincing, but also easily leads to Under the goal of optimizing one's own interests, obtaining large business contributions with a large amount of low-quality business information has a large negative impact on the accuracy of the overall business allocation, resulting in lower accuracy of business allocation.

Summary of the invention

This application provides a service distribution method, device, equipment and storage medium based on reinforcement learning, which are used to improve the accuracy of service distribution.

The first aspect of this application provides a business allocation method based on reinforcement learning, including:

Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;

Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;

According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;

Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;

According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.

A second aspect of the present application provides a service distribution device based on reinforcement learning, wherein the service distribution device based on reinforcement learning includes: a memory, a processor, and stored in the memory and capable of running on the processor In the reinforcement learning-based business allocation program, the processor implements the following steps when executing the reinforcement learning-based business allocation program:

The third aspect of the present application provides a computer-readable storage medium that stores computer instructions, and when the computer instructions are executed on a computer, the computer executes the following steps:

The fourth aspect of the present application provides a service distribution device based on reinforcement learning, including:

The prediction module is used to obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset The evaluator is used to evaluate the gradient value of the feature vector information provided by each participant;

A sampling module, configured to sample the feature vector information by using a preset sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;

The update module is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to the participating terminals to obtain the updated federation assessment model, and calculate the reward value through the updated federation assessment model. The reward value is used to indicate the cumulative return value of the feature vector information to the updated federal evaluation model;

An evaluation module, configured to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;

The distribution module is used for performing business distribution on the multiple participating terminals according to the participant contribution degree corresponding to each participating terminal to obtain the business distribution information of the participants.

In the technical solution provided by this application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the pre-selected information is calculated according to the sampling gradient information corresponding to each participating terminal. The model parameters of the federated business evaluation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business is distributed to multiple participating terminals, which reduces The complexity of calculation improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.

Description of the drawings

FIG. 1 is a schematic diagram of an embodiment of a service allocation method based on reinforcement learning in an embodiment of this application;

2 is a schematic diagram of another embodiment of a service allocation method based on reinforcement learning in an embodiment of the application;

3 is a schematic diagram of an embodiment of a service distribution device based on reinforcement learning in an embodiment of the application;

4 is a schematic diagram of another embodiment of a service distribution device based on reinforcement learning in an embodiment of the application;

Fig. 5 is a schematic diagram of an embodiment of a service distribution device based on reinforcement learning in an embodiment of the application.

Detailed ways

The embodiments of the present application provide a method, device, equipment, and storage medium for service distribution based on reinforcement learning, which improve the accuracy of service distribution.

For ease of understanding, the following describes the specific process of the embodiment of the present application. Please refer to FIG. 1. An embodiment of the service allocation method based on reinforcement learning in the embodiment of the present application includes:

101. Obtain feature vector information based on the private data of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal. The preset evaluator is used to evaluate each participating terminal The gradient value of the feature vector information provided.

It is understandable that the execution subject of this application may be a service distribution device based on reinforcement learning, and may also be a terminal or a server, which is not specifically limited here. The embodiment of the present application takes the server as the execution subject as an example for description.

Among them, multiple participating terminals include, but are not limited to, terminals or servers corresponding to medical institutions, insurance institutions, and financial institutions, for example, terminals or servers corresponding to units in financial institutions, banks, or other institutions or centers involved in finance. Multiple participating terminals can be terminals or servers corresponding to the same type of institutions, for example: multiple participating terminals are terminals or servers corresponding to financial institutions; multiple participating terminals can also be terminals or servers corresponding to different types of institutions, for example: multiple The number of participating terminals is 3, one participating terminal is a terminal or server corresponding to a financial institution, the other participating terminal is a terminal or server corresponding to an insurance institution, and the remaining one participating terminal is a terminal or server corresponding to a medical institution.

Institutional private data is non-shared and encrypted private data of the participating end, for example: private medical data of medical institutions, private data of various financial services of financial institutions. The feature vector information can be the gradient information corresponding to the model parameters when each participating terminal performs gradient descent processing on the model according to the private data of the organization, or it can be the private data of the organization of each participating terminal. In the preset business evaluation federal model, the private data of all organizations is Proportion of processing a certain business.

The server constructs a preset business evaluation federation model in advance based on the private data of multiple participating institutions. The server receives the feature vector information based on the organization's private data sent by multiple participating terminals, and calls the preset evaluator. The preset evaluator is a deep neural network. The deep neural network includes a proportional selection algorithm. The feature is calculated through the proportional selection algorithm. The fitness of each feature vector information in the vector information, according to the fitness of each feature vector information, calculate the cumulative probability of each feature vector information being inherited to the next generation population, and generate a uniformly distributed random number in the interval [0,1], The feature vector information is selected according to the random number, and the probability values of the multiple feature vector information selected by each participating terminal are normalized to obtain the selection probability corresponding to each participating terminal. Among them, the preset evaluator is used to evaluate the gradient value provided by each participant. The gradient value is the degree to which the gradient information plays a role in model training, or the private data of each participant’s organization. The federal model is private to all organizations in the preset business evaluation The extent to which a certain business processing direction of data plays a role.

102. Sampling the feature vector information through the preset sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.

Among them, the feature vector information is gradient information. The server calls the deterministic algorithm in the preset sampler to generate a pseudo-random number sequence with gradient information between [0, 1]. Randomly sample the number sequence to obtain the sampling gradient information corresponding to each participant; the server calls the preset sampler to classify all or the corresponding gradient information of each participant into multiple pieces of data according to the time sequence of the gradient information, corresponding to each participant The selection probability extracts the corresponding data from each piece of data to obtain the sampling gradient information corresponding to each participant; the server can also call the preset sampler to classify all gradient information into preset categories (the number of preset categories includes one Or more than one) to obtain the gradient information of multiple categories, randomly extract the corresponding gradient information from the gradient information of each category according to the corresponding selection probability of each participant, and combine the gradient information in the multiple extracted categories, Obtain the sampling gradient information corresponding to each participant.

103. According to the sampling gradient information corresponding to each participating terminal, update the model parameters of the preset business evaluation federation model to obtain the updated federation assessment model, and calculate the reward value through the updated federation evaluation model. The reward value is used to indicate that the feature vector information is effective Update the cumulative return value of the federal assessment model.

The server normalizes the sampling gradient information corresponding to each participating terminal to obtain comprehensive gradient information. According to the comprehensive gradient information, it continuously adjusts and updates the model parameters of the preset business evaluation federation model, thereby obtaining an updated federation evaluation model; The server can also use the preset attention mechanism to perform attention calculation on the sampling gradient information corresponding to each participating terminal to obtain the attention gradient information corresponding to each participating terminal. Through the attention gradient information corresponding to each participating terminal, the preset The model parameters of the business evaluation federation model are continuously adjusted and updated to obtain an updated federation evaluation model to ensure the characteristics of all sampling gradient information and to update the model parameters of the preset business evaluation federation model with a certain degree , Get the updated federal evaluation model.

The server receives the institutional private data of multiple participating ends, divides the institutional private data into the institutional data verification set according to the preset ratio, and obtains the sampling gradient information corresponding to the institutional data verification set according to the above-mentioned execution process of obtaining sampling gradient information, thereby Obtain the verification set data of the feature vector information, verify the updated federal evaluation model through the verification set data, obtain the verification federal evaluation model, calculate the verification loss value of the verification federal evaluation model, and the moving average loss value of the preset time period; verify the verification loss The difference between the value and the moving average loss value is calculated to obtain the reward value, where the reward value can be the total return of the preset business evaluation federated model during training (which can include the income signal or information gain), and the reward value includes the positive reward value and Negative reward value, the reward value is used to obtain the optimal strategy and/or the best path. The reward value can be the return of the best strategy and/or the best path for the preset business evaluation federation model to process the private data of all institutions for a certain business Cumulative value, for example: the cumulative return value of the best strategy for the allocation of insurance order amount data items.

104. Perform value evaluation on the feature vector information through the preset evaluator and reward value, and obtain the participant contribution degree corresponding to each participating end.

The server obtains the business contribution degree impact factor, and evaluates the business contribution value of the feature vector information through the preset evaluator, reward value and business contribution degree impact factor, and obtains the participation contribution degree corresponding to each participant. The participant contribution degree can be Evaluate the contribution of the private data of the participating institutions to the training of the federal model for the preset business, and also evaluate the private data of the participating institutions when the preset business evaluation federal model effectively and accurately process the private data of all institutions. For example, the role of the private institutional data of each participant in the accurate prediction of the financial income data (the private data of the participating institutions) of the preset business evaluation federal model for the preset time period, and Contribution to its accuracy. Among them, the business contribution degree influence factor is the factor that calculates the business contribution degree. For example, taking the business as an example to train an insurance order amount data prediction model based on federal learning, the business contribution degree influence factor is the accuracy of the insurance order, The information provided and the importance of the type of business, etc.

105. According to the participant contribution degree corresponding to each participating terminal, perform business distribution on multiple participating terminals to obtain participant business distribution information.

The server can judge whether the contribution of each participant corresponding to the participant is less than the preset target value, if it is, it will remove the participant, and the participant will not participate in the business distribution, if not, it will be based on the participant's contribution corresponding to each participant. Degree, perform business distribution to multiple participating terminals, and obtain the business distribution information of the participants. Among them, the business distribution information of the participants may be income distribution, reward distribution and/or priority setting based on the contribution of the private data of the organization.

In the embodiment of the present application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset service is determined according to the sampling gradient information corresponding to each participating terminal. The model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business is distributed to the multiple participants, which reduces the calculation Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.

Referring to FIG. 2, another embodiment of the service allocation method based on reinforcement learning in the embodiment of the present application includes:

201. Obtain feature vector information based on the private data of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal. The preset evaluator is used to evaluate each participating terminal The gradient value of the feature vector information provided.

Specifically, the server sends model gradient calculation instructions to multiple participating terminals respectively, so that each participating terminal obtains the private institutional data of the participating terminal according to the model gradient calculation instruction; trains the preset participant allocation model through the private institutional data of the participating terminal, And through the preset gradient descent algorithm, the parameter gradient of the trained preset participant allocation model is calculated, and the corresponding feature vector information of each participant is obtained. The private data of the institution includes the private medical data of the medical institution and the financial business of the financial institution. At least one of private data and insurance private data of insurance institutions; receiving the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculating the corresponding feature vector of each participating terminal through the gradient value function in the preset evaluator Information is calculated for selection probability, and the selection probability corresponding to each participating terminal is obtained.

For example, taking the private data of an organization as the private data of financial business as an example, the multiple participating terminals are the terminal 1 corresponding to the financial institution 1, the terminal 2 corresponding to the financial institution 2, and the terminal 3 corresponding to the financial institution 3. 2 and terminal 3 send model gradient calculation instructions, terminal 1 extracts the corresponding financial business private data 1 from the database according to the model gradient calculation instruction, and inputs the financial business private data 1 to the preset participant allocation model 1, and participates through the preset Participant allocation model 1 performs business allocation processing (ie training) on the financial business private data 1, and calculates the parameter gradient of the preset participant allocation model 1 after business allocation processing (ie training) through the preset gradient descent algorithm, and obtains the terminal 1 corresponds to feature vector information 1. Terminal 1 sends feature vector information 1 to the server. After receiving the feature vector information 1, the server inputs the feature vector information into the gradient value function, and calculates the corresponding selection of terminal 1 by extracting the value function Probability 1, the gradient value function is as follows: w=h _θ (δ), where w represents the probability of selection, h _θ (δ) is the gradient value function, δ represents the feature vector information, φ represents the trainable parameters, and the terminal can be obtained in turn Selection probability 2 corresponding to 2 and selection probability 3 corresponding to terminal 3.

202. Through the polynomial-based distribution algorithm in the preset sampler, the feature vector information is calculated according to the selection probability of each participating terminal to obtain the selection vector corresponding to each participating terminal. The feature vector information includes model gradient information of each participating terminal.

The server uses the probability formula based on the polynomial distribution algorithm in the preset sampler to calculate the model gradient information of each participant according to the corresponding selection probability of each participant, and obtains the selection vector ζi=[ζ1,ζ2, ζ3,...ζn], where ζi={0,1} and P(ζi=1)=w, P represents the probability value, and w represents the selection probability.

203. When the selection vector corresponding to each participating terminal is a preset value, sample the model gradient information of each participating terminal according to the selection vector corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.

In this embodiment, the preset value is preferably 1. The server judges whether the selection vector corresponding to each participant end is the preset value (ie ζi=1), and if so, the model gradient of each participant end according to the selection vector corresponding to each participant end The information is randomly sampled or systematically sampled or stratified sampling to obtain the sampling gradient information corresponding to each participating terminal; if not, the model gradient information of each participating terminal is not sampled, and each of the above methods for obtaining the selection vector corresponding to each participating terminal is performed in a loop. Perform the process to re-acquire the selection vector until the re-acquired selection vector is the preset value (ie ζi = 1), and sample the model gradient information of each participating terminal according to the selection vector corresponding to each participating terminal to obtain each participating terminal Corresponding sampling gradient information.

204. According to the sampling gradient information corresponding to each participating terminal, update the model parameters of the preset business evaluation federation model to obtain the updated federation assessment model, and calculate the reward value through the updated federation evaluation model. The reward value is used to indicate that the feature vector information is effective Update the cumulative return value of the federal assessment model.

The server normalizes the sampling gradient information corresponding to each participating terminal to obtain comprehensive gradient information. According to the comprehensive gradient information, it continuously adjusts and updates the model parameters of the preset business evaluation federation model, thereby obtaining an updated federation evaluation model; The server can also use the preset attention mechanism to perform attention calculation on the sampling gradient information corresponding to each participating terminal, and obtain the attention gradient information corresponding to each participating terminal. Through the attention gradient information corresponding to each participating terminal, the preset The model parameters of the business evaluation federation model are continuously adjusted and updated to obtain an updated federation evaluation model to ensure the characteristics of all sampling gradient information and to update the model parameters of the preset business evaluation federation model with a certain degree , Get the updated federal evaluation model.

Specifically, the server obtains the verification set data of the feature vector information, verifies the verification set data by updating the federated model, and obtains the verification result; calculates the verification loss value of the verification result and the moving average loss value of the preset time period; compares the verification loss value Calculate the difference with the moving average loss value to get the reward value.

When the server receives the feature vector information of the organization's private data of multiple participating ends, it divides the feature vector information of the organization's private data into verification set data according to a preset ratio, and verifies the verification set data by updating the federation model to obtain the verification result , Calculate the verification loss value of the verification result through the preset loss value calculation formula. The preset loss value calculation formula is as follows:

Among them, l _v indicates the verification loss value, v indicates that the data belongs to the verification set, that is, the verification result, M indicates the number of all data items in the verification set data, and k indicates the k-th data item.

Indicates the required loss function, including the mean square error (MSE) function, root-mean-square error (RMSE) function, and cross-entropy loss function, etc. f _θ represents the updated federated evaluation model, x represents the input data, that is, the validation set data, and y represents the corresponding label of the validation set data; the server calculates the moving average loss value for the preset time period through a preset formula, and the preset formula is as follows:

l _v represents the verification loss value, T represents the moving average window length of the preset time period, and Δ represents the moving average benchmark of the preset time period; the verification loss value is subtracted from the moving average loss value to obtain the reward value.

205. The value of the feature vector information is evaluated through the preset evaluator and the reward value, and the participant contribution degree corresponding to each participating terminal is obtained.

Specifically, the server uses the preset Monte Carlo strategy gradient algorithm to calculate the loss function of the reward value and the selection vector to obtain the loss function of the preset evaluator; through the loss function of the preset evaluator, perform the calculation on the preset evaluator. Training until the loss function converges to obtain the target evaluator; through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution corresponding to each participating end is obtained.

The server uses the calculation formula in the Monte Carlo strategy gradient reinforce algorithm to calculate the loss function of the reward value and the selection vector to obtain the loss function of the preset evaluator. The calculation formula in the Monte Carlo strategy gradient reinforce algorithm is as follows:

Among them, r represents the reward value, N represents the number of selection vectors, i represents the ith selection vector, s _i represents the selection vector, δ _i represents sampling gradient information, and h _φ (δ _i ) represents the gradient value function. The server updates the trainable parameter φ of the preset evaluator through the loss function of the preset evaluator to realize the training of the preset evaluator and obtain the target evaluator. The formula for updating the trainable parameter φ is as follows:

Among them, β represents the learning rate, and l _h represents the loss function of the preset estimator. The server evaluates the value of the feature vector information through the target evaluator, and obtains the participant contribution degree corresponding to each participant.

206. According to the participant contribution degree corresponding to each participating terminal, perform business distribution on multiple participating terminals to obtain participant business distribution information.

Specifically, the server obtains the contribution percentage value of the participant contribution corresponding to each participating end, and determines whether the contribution percentage value is greater than the preset threshold; if the contribution percentage value is greater than the preset threshold, the preset distribution strategy is invoked, Perform business distribution to multiple participating terminals to obtain the participant’s business distribution information; if the contribution rate is less than or equal to the preset threshold, then according to the contribution rate, the business distribution is performed on multiple participating terminals to obtain the participant’s business distribution information .

Among them, the server judges whether the participant contribution degree corresponding to each participating terminal is less than the preset contribution degree, and if so, eliminates the corresponding participant contribution degree less than the preset contribution degree, and calculates the participant contribution degree corresponding to each participating terminal after the elimination. Respectively and the ratio between the total participant contribution and the value, the contribution ratio of each participant's contribution corresponding to each participant's end is obtained. If not, the participant's contribution corresponding to each participant's end is calculated separately and the total participation The ratio between the contribution of the participants is obtained, and the contribution ratio of the contribution of each participant corresponding to each participating terminal is obtained.

For example, take the business distribution as the remuneration distribution, and the participant’s business distribution information is the participant’s remuneration distribution information. Participant contributions are 0.03 (participating terminal 1), 0.24 (participating terminal 2), 0.40 (participating terminal 3), and 0.33 (participating terminal 4). The preset contribution rate is 0.20, the preset threshold is 0.40, and the reward is distributed. The total amount is 1 million, then 0.03 is removed, and the participant's business distribution information 1 (participating end 1 gets paid 0 yuan), and 0.24 (participating end 2), 0.40 (participating end 3) and 0.33 (participating end 4) are obtained respectively The contribution of the corresponding participant contribution is 0.247 (participating end 2), 0.412 (participating end 3) and 0.34 (participating end 4). Only 0.412 is greater than the preset threshold 0.40, then the preset allocation strategy pair is called Participating terminal 3 distributes (1 million * 0.412 + 100,000 = 512,000) to obtain participant business distribution information 3 (participating terminal 3 gets a reward of 51.2 yuan). The preset distribution strategy is in addition to the total amount distributed according to the reward The amount of money corresponding to the contribution percentage value is extra, and an additional 100,000 is allocated. According to the contribution percentage value, the business distribution of the participant 2 and the participant 4 is carried out (participant 2 = 0.247*1 million = 247,000, and participant 4 = 0.34*1 million = 34 million), and obtain participant business distribution information 2 (participating terminal 2 receives a remuneration of 24.7 yuan) and participant business distribution information 4 (participating terminal 4 receives a remuneration of 34.0 yuan).

Specifically, the server distributes services to multiple participating terminals according to the participant contribution degree corresponding to each participating terminal. After obtaining the participant's business distribution information, it also obtains the abnormal information of the participant's business distribution information, and performs business assignments to the participants based on the abnormal information. The allocation information is updated, and the selection probability determination strategy corresponding to each participating terminal is optimized.

The server encrypts the participant’s business distribution information and sends it to the auditing side. The auditing side decrypts and audits the participant’s business distribution information. If there is abnormal information in the participant’s business distribution information, the abnormal information will be fed back to the server, and the server will respond to the abnormal information. Match the corresponding optimization mechanism. The optimization mechanism includes optimization algorithms, optimization strategies, optimized execution processes and optimized execution scripts. The abnormal information in the participant’s business allocation information is corrected (updated) through the optimization mechanism, and the optimization mechanism is used to correct The selection probability determination strategy corresponding to each participating terminal is optimized. The determination strategy includes model selection, model calculation, and feature vector information selection of the selection probability corresponding to each participating terminal. Update and optimize the determination strategy of the selection probability corresponding to each participating terminal, which improves the calculation convenience, calculation accuracy and calculation efficiency of the execution process of the business allocation method based on reinforcement learning, thereby improving the accuracy of business allocation.

In the embodiment of the present application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset business is determined according to the sampling gradient information corresponding to each participating terminal. The model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business is distributed to the multiple participants, which reduces the calculation Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.

The above describes the service distribution method based on reinforcement learning in the embodiment of this application. The following describes the service distribution device based on reinforcement learning in the embodiment of this application. Please refer to FIG. 3, the service distribution device based on reinforcement learning in the embodiment of this application One embodiment includes:

The prediction module 301 is used to obtain the feature vector information based on the private data of the institutions of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal. The preset evaluator is used for Evaluate the gradient value of the feature vector information provided by each participant;

The sampling module 302 is configured to sample the feature vector information through the preset sampler and the corresponding selection probability of each participating terminal to obtain sampling gradient information corresponding to each participating terminal;

The update module 303 is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to each participating terminal to obtain the updated federation evaluation model, and calculate the reward value through the updated federation evaluation model, and the reward value is used to indicate The cumulative value of eigenvector information for the update of the federal evaluation model;

The evaluation module 304 is used to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;

The distribution module 305 is used to distribute services to multiple participating terminals according to the participant contribution corresponding to each participating terminal to obtain participant service distribution information.

The function realization of each module in the above-mentioned reinforcement learning-based service distribution device corresponds to the steps in the above-mentioned embodiment of the above-mentioned reinforcement learning-based service distribution method, and the functions and realization processes thereof will not be repeated here.

Referring to FIG. 4, another embodiment of a service distribution apparatus based on reinforcement learning in an embodiment of the present application includes:

Among them, the sampling module 302 specifically includes:

The calculation unit 3021 is configured to calculate the feature vector information according to the selection probability corresponding to each participating terminal through the polynomial distribution algorithm in the preset sampler to obtain the selection vector corresponding to each participating terminal. The feature vector information includes the corresponding selection vector of each participating terminal. Model gradient information;

The sampling unit 3022 is configured to sample the model gradient information of each participating terminal according to the selection vector corresponding to each participating terminal when the selection vector corresponding to each participating terminal is a preset value to obtain sampling gradient information corresponding to each participating terminal;

The allocation module 305 is used to allocate services to multiple participating terminals according to the participant contribution corresponding to each participating terminal to obtain the participant's business distribution information.

Optionally, the evaluation module 304 can also be specifically used for:

Through the preset Monte Carlo strategy gradient algorithm, calculate the loss function of the reward value and the selection vector to obtain the loss function of the preset estimator;

Train the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain the target evaluator;

Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.

Optionally, the prediction module 301 may also be specifically used to:

Sending model gradient calculation instructions to multiple participating terminals respectively, so that each participating terminal obtains the private data of the participating organizations according to the model gradient calculation instruction;

Through the private data of the participating institutions, the preset participant allocation model is trained, and the preset gradient descent algorithm is used to calculate the parameter gradient of the trained preset participant allocation model to obtain the corresponding feature vector information of each participant , Institutional private data includes at least one of medical private data of medical institutions, financial business private data of financial institutions, and insurance private data of insurance institutions;

The feature vector information corresponding to each participating terminal is received, and the selection probability of the feature vector information corresponding to each participating terminal is calculated through the gradient value function in the preset evaluator to obtain the selection probability corresponding to each participating terminal.

Optionally, the update module 303 can also be specifically used for:

Obtain the verification set data of the feature vector information, verify the verification set data by updating the federation model, and obtain the verification result;

Calculate the verification loss value of the verification result and the moving average loss value of the preset time period;

The difference between the verification loss value and the moving average loss value is calculated to obtain the reward value.

Optionally, the allocation module 305 may also be specifically used for:

Obtain the contribution percentage value of the participant's contribution corresponding to each participating terminal, and determine whether the contribution percentage value is greater than a preset threshold;

If the contribution ratio is greater than the preset threshold, the preset distribution strategy is invoked to distribute services to multiple participating terminals to obtain the participant's service distribution information;

If the contribution rate is less than or equal to the preset threshold value, then according to the contribution rate, business distribution is performed on multiple participating terminals to obtain participant business distribution information.

Optionally, a service distribution device based on reinforcement learning also includes:

The update optimization module 306 is used to obtain abnormal information of the participant's business allocation information, update the participant's business allocation information according to the abnormal information, and optimize the determination strategy of the selection probability corresponding to each participant.

The function realization of each module and each unit in the above-mentioned reinforcement learning-based service distribution device corresponds to each step in the above-mentioned embodiment of the above-mentioned service distribution method based on reinforcement learning, and the functions and realization processes thereof will not be repeated here.

In the embodiment of the present application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset service is determined according to the sampling gradient information corresponding to each participating terminal. The model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business distribution of multiple participants is performed, which reduces the computational cost. Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.

The above Figures 3 and 4 describe in detail the reinforcement learning-based service distribution device in this embodiment of the application from the perspective of modular functional entities. The following describes the reinforcement learning-based service distribution device in this embodiment of the application in detail from the perspective of hardware processing. describe.

FIG. 5 is a schematic structural diagram of a service distribution device based on reinforcement learning provided by an embodiment of the present application. The service distribution device 500 based on reinforcement learning may have relatively large differences due to different configurations or performances, and may include one or more A processor (central processing units, CPU) 510 (for example, one or more processors), a memory 520, and one or more storage media 530 (for example, one or one storage device with a large amount of storage) storing application programs 533 or data 532. Among them, the memory 520 and the storage medium 530 may be short-term storage or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the service distribution device 500 based on reinforcement learning. Further, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the service distribution device 500 based on reinforcement learning.

The service distribution device 500 based on reinforcement learning may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or one or more operating systems 531, For example, Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the service distribution device based on reinforcement learning shown in FIG. 5 does not constitute a limitation on the service distribution device based on reinforcement learning, and may include more or less components than shown in the figure, or a combination of certain components. Some components, or different component arrangements.

This application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the read storage medium, and when the instructions are run on the computer, the computer executes the steps of the business distribution method based on reinforcement learning.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store information according to the blockchain node Use the created data, etc.

The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.

If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Above, the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing various implementations. The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims

A business allocation method based on reinforcement learning, wherein the business allocation method based on reinforcement learning includes:

Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;

Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;

According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;

Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;

According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
The service distribution method based on reinforcement learning according to claim 1, wherein the feature vector information is sampled by the preset sampler and the selection probability corresponding to each participating terminal to obtain the corresponding selection probability of each participating terminal Sampling gradient information, including:

Through the polynomial-based distribution algorithm in the preset sampler, the feature vector information is calculated according to the selection probability corresponding to each participating terminal to obtain the selection vector corresponding to each participating terminal. The feature vector information includes the Model gradient information;

When the selection vector corresponding to each participating terminal is a preset value, the model gradient information of each participating terminal is sampled according to the selection vector corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.
The method of business distribution based on reinforcement learning according to claim 2, wherein the value evaluation of the feature vector information is performed through the preset evaluator and the reward value to obtain the participant corresponding to each participating end Contribution, including:

Perform a loss function calculation on the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain the loss function of the preset evaluator;

Training the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain a target evaluator;

Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.
The method of business distribution based on reinforcement learning according to claim 1, wherein said acquiring feature vector information based on the private data of institutions of multiple participating terminals, and predicting the selection probability of said feature vector information through a preset evaluator , Get the selection probability corresponding to each participating terminal, including:

Respectively sending model gradient calculation instructions to multiple participating terminals, so that each participating terminal obtains the private data of the organization of the participating terminal according to the model gradient calculation instruction;

Train the preset participant allocation model through the private institutional data of the participating end, and calculate the parameter gradient of the trained preset participant allocation model through the preset gradient descent algorithm to obtain the corresponding characteristics of each participant end Vector information, where the private data of the institution includes at least one of private medical data of a medical institution, private financial data of a financial institution, and private insurance data of an insurance institution;

Receive the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculate the selection probability of the feature vector information corresponding to each participating terminal through the gradient value function in the preset evaluator, to obtain the corresponding selection of each participating terminal Probability.
The method of business distribution based on reinforcement learning according to claim 1, wherein said calculating a reward value through said updated federated evaluation model comprises:

Acquiring verification set data of the feature vector information, verifying the verification set data through the updated federated model, to obtain a verification result;

Calculating the verification loss value of the verification result and the moving average loss value of the preset time period;

A difference calculation is performed on the verification loss value and the moving average loss value to obtain a reward value.
The method for business distribution based on reinforcement learning according to claim 1, wherein said performing business distribution on said multiple participating terminals according to the participant contribution degree corresponding to said participating terminals to obtain participant business distribution information, include:

Acquiring the contribution percentage value of the participant contribution corresponding to each participating terminal, and determining whether the contribution percentage value is greater than a preset threshold;

If the contribution ratio is greater than the preset threshold, call a preset distribution strategy to distribute services to the multiple participating terminals to obtain service distribution information of the first participant;

If the contribution percentage value is less than or equal to the preset threshold value, then according to the contribution percentage value, service distribution is performed on the multiple participating terminals to obtain the second participant's service distribution information.
The service distribution method based on reinforcement learning according to any one of claims 1-6, wherein the service distribution is performed on the multiple participating terminals according to the participant contribution degree corresponding to each participating terminal to obtain After the participant's business distribution information, it also includes:

Obtain the abnormal information of the participant's service distribution information, update the participant's service distribution information according to the abnormal information, and optimize the determination strategy of the selection probability corresponding to each participant.
A service distribution device based on reinforcement learning, wherein the service distribution device based on reinforcement learning includes: a memory, a processor, and reinforcement learning-based service distribution stored in the memory and running on the processor A program, the processor implements the following steps when executing the business allocation program based on reinforcement learning:

Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;

Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;

According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;

Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;

According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
8. The service distribution device based on reinforcement learning according to claim 8, wherein the processor executes the service distribution program based on reinforcement learning to realize the selection probabilities corresponding to the respective participating terminals through the preset sampler, When sampling the feature vector information to obtain sampling gradient information corresponding to each participating terminal, the following steps are included:

Through the polynomial-based distribution algorithm in the preset sampler, the feature vector information is calculated according to the selection probability corresponding to each participating terminal to obtain the selection vector corresponding to each participating terminal. The feature vector information includes the Model gradient information;

When the selection vector corresponding to each participating terminal is a preset value, the model gradient information of each participating terminal is sampled according to the selection vector corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.
9. The service distribution device based on reinforcement learning according to claim 9, wherein the processor executes the service distribution program based on reinforcement learning to realize the use of the preset evaluator and the reward value for the When evaluating the value of the feature vector information and obtaining the participant contribution corresponding to each participating end, the following steps are included:

Perform a loss function calculation on the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain the loss function of the preset evaluator;

Training the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain a target evaluator;

Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.
The service distribution device based on reinforcement learning according to claim 8, wherein the processor executes the service distribution program based on reinforcement learning to realize the acquisition of feature vector information based on the private data of institutions of multiple participating terminals, and through The preset evaluator to predict the selection probability of the feature vector information and obtain the selection probability corresponding to each participating terminal includes the following steps:

Respectively sending model gradient calculation instructions to multiple participating terminals, so that each participating terminal obtains the private data of the organization of the participating terminal according to the model gradient calculation instruction;

Train the preset participant allocation model through the private institutional data of the participating end, and calculate the parameter gradient of the trained preset participant allocation model through the preset gradient descent algorithm to obtain the corresponding characteristics of each participant end Vector information, where the private data of the institution includes at least one of private medical data of a medical institution, private financial data of a financial institution, and private insurance data of an insurance institution;

Receive the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculate the selection probability of the feature vector information corresponding to each participating terminal through the gradient value function in the preset evaluator, to obtain the corresponding selection of each participating terminal Probability.
8. The reinforcement learning-based service distribution device according to claim 8, wherein when the processor executes the reinforcement learning-based service distribution program to realize the calculation of the reward value through the updated federated evaluation model, the method comprises the following steps:

Acquiring verification set data of the feature vector information, verifying the verification set data through the updated federated model, to obtain a verification result;

Calculating the verification loss value of the verification result and the moving average loss value of the preset time period;

A difference calculation is performed on the verification loss value and the moving average loss value to obtain a reward value.
8. The reinforcement learning-based service distribution device according to claim 8, wherein the processor executes the reinforcement learning-based service distribution program to realize the contribution to the participant according to the participant contribution corresponding to each participant end When multiple participating terminals perform business distribution and obtain the participant's business distribution information, the following steps are included:

Acquiring the contribution percentage value of the participant contribution corresponding to each participating terminal, and determining whether the contribution percentage value is greater than a preset threshold;

If the contribution ratio is greater than the preset threshold, call a preset distribution strategy to distribute services to the multiple participating terminals to obtain service distribution information of the first participant;

If the contribution percentage value is less than or equal to the preset threshold value, then according to the contribution percentage value, service distribution is performed on the multiple participating terminals to obtain the second participant's service distribution information.
The service distribution device based on reinforcement learning according to claims 8-13, wherein the processor executes the service distribution program based on reinforcement learning to realize the contribution of the participants according to the respective participating terminals. After the multiple participating terminals perform service distribution and obtain participant service distribution information, the following steps are included:

Obtain the abnormal information of the participant's service distribution information, update the participant's service distribution information according to the abnormal information, and optimize the determination strategy of the selection probability corresponding to each participant.
A computer-readable storage medium in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer executes the following steps:

Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;

Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;

According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;

Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;

According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
15. The computer-readable storage medium according to claim 15, wherein the computer-readable storage medium executes the computer instructions to realize the selection probabilities corresponding to the respective participating terminals through the preset sampler, and to compare the feature vector information When sampling to obtain the sampling gradient information corresponding to each participant, the following steps are included:

Through the polynomial-based distribution algorithm in the preset sampler, the feature vector information is calculated according to the selection probability corresponding to each participating terminal to obtain the selection vector corresponding to each participating terminal. The feature vector information includes the Model gradient information;

When the selection vector corresponding to each participating terminal is a preset value, the model gradient information of each participating terminal is sampled according to the selection vector corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.
The computer-readable storage medium according to claim 16, the computer-readable storage medium executes the computer instructions to realize the value evaluation of the feature vector information through the preset evaluator and the reward value , When obtaining the participant contribution degree corresponding to each participating terminal, the following steps are included:

Perform a loss function calculation on the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain the loss function of the preset evaluator;

Training the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain a target evaluator;

Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.
15. The computer-readable storage medium according to claim 15, wherein the computer-readable storage medium executes the computer instructions to realize the acquisition of feature vector information based on the private data of the organization of a plurality of participating terminals, and through a preset evaluator, When using the feature vector information to predict the selection probability and obtain the selection probability corresponding to each participating terminal, the following steps are included:

Respectively sending model gradient calculation instructions to multiple participating terminals, so that each participating terminal obtains the private data of the organization of the participating terminal according to the model gradient calculation instruction;

Train the preset participant allocation model through the private institutional data of the participating end, and calculate the parameter gradient of the trained preset participant allocation model through the preset gradient descent algorithm to obtain the corresponding characteristics of each participant end Vector information, where the private institution data includes at least one of private medical data of a medical institution, private financial data of a financial institution, and private insurance data of an insurance institution;

Receive the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculate the selection probability of the feature vector information corresponding to each participating terminal through the gradient value function in the preset evaluator, to obtain the corresponding selection of each participating terminal Probability.
15. The computer-readable storage medium according to claim 15, when the computer-readable storage medium executes the computer instructions to realize the calculation of the reward value through the updated federated evaluation model, the method comprises the following steps:

Acquiring verification set data of the feature vector information, verifying the verification set data through the updated federated model, to obtain a verification result;

Calculating the verification loss value of the verification result and the moving average loss value of the preset time period;

A difference calculation is performed on the verification loss value and the moving average loss value to obtain a reward value.
A service distribution device based on reinforcement learning, wherein the service distribution device based on reinforcement learning includes:

The prediction module is used to obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset The evaluator is used to evaluate the gradient value of the feature vector information provided by each participant;

A sampling module, configured to sample the feature vector information by using a preset sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;

The update module is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to the participating terminals to obtain the updated federation assessment model, and calculate the reward value through the updated federation assessment model. The reward value is used to indicate the cumulative return value of the feature vector information to the updated federal evaluation model;

An evaluation module, configured to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;

The distribution module is used for performing business distribution on the multiple participating terminals according to the participant contribution degree corresponding to each participating terminal to obtain the business distribution information of the participants.