CN112381428B

CN112381428B - Service distribution method, device, equipment and storage medium based on reinforcement learning

Info

Publication number: CN112381428B
Application number: CN202011298673.1A
Authority: CN
Inventors: 朱星华; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2020-11-19
Filing date: 2020-11-19
Publication date: 2023-09-19
Anticipated expiration: 2040-11-19
Also published as: CN112381428A; WO2021208720A1

Abstract

The invention relates to the technical field of artificial intelligence, and provides a business distribution method, device, equipment and storage medium based on reinforcement learning, which are used for improving the accuracy of business distribution. The service distribution method based on reinforcement learning comprises the following steps: selecting probability prediction is carried out on feature vector information of the mechanism private data based on a plurality of participating terminals to obtain selecting probability; sampling the feature vector information by selecting probability to obtain sampling gradient information; updating model parameters of a preset business evaluation federation model according to the sampling gradient information to obtain an updated federation evaluation model, and calculating a reward value through the updated federation evaluation model; the value evaluation is carried out on the feature vector information through a preset evaluator and a reward value, so that the contribution degree of the participants is obtained; and carrying out service distribution on a plurality of participant terminals according to the contribution degree of the participants to obtain the service distribution information of the participants. In addition, the invention also relates to a blockchain technology, and the mechanism private data can be stored in the blockchain.

Description

Service distribution method, device, equipment and storage medium based on reinforcement learning

Technical Field

The present invention relates to the field of machine learning of artificial intelligence, and in particular, to a method, an apparatus, a device, and a storage medium for service allocation based on reinforcement learning.

Background

With the development of internet technology and services, management and control and call of service information by enterprises and institutions are also important concerns of the enterprises and institutions, for example: and distributing the business to each sub-enterprise according to the business information provided by all the sub-enterprises to which the enterprise belongs.

Currently, in order to better distribute services to all sub-enterprises according to service information provided by all sub-enterprises to which the enterprise belongs, the service contribution degree of the sub-enterprises is generally assessed according to the data volume of the service information provided by the sub-enterprises, and the service distribution is performed to the sub-enterprises according to the assessed service contribution degree.

Because the quality of the service information provided by different sub-enterprises is different, the evaluation of the service contribution degree of each sub-enterprise only from the data volume of the service information provided by the sub-enterprise is not only persuasive, but also easily causes the sub-enterprise to acquire larger service contribution degree by using a large amount of low-quality service information under the goal of maximizing own benefit as guidance, thereby causing larger adverse effect on the accuracy of overall service distribution and leading to lower accuracy of service distribution.

Disclosure of Invention

The invention provides a service distribution method, device, equipment and storage medium based on reinforcement learning, which are used for improving the accuracy of service distribution.

The first aspect of the present invention provides a service allocation method based on reinforcement learning, including:

acquiring feature vector information based on mechanism private data of a plurality of participating terminals, and predicting the selection probability of the feature vector information through a preset evaluator to obtain the corresponding selection probability of each participating terminal, wherein the preset evaluator is used for evaluating the gradient value of the feature vector information provided by each participating terminal;

sampling the feature vector information by presetting a sampler and selection probabilities corresponding to all the participation ends to obtain sampling gradient information corresponding to all the participation ends;

updating model parameters of a preset business evaluation federation model according to sampling gradient information corresponding to each participating end to obtain an updated federation evaluation model, and calculating a reward value through the updated federation evaluation model, wherein the reward value is used for indicating a return accumulated value of the feature vector information for the updated federation evaluation model;

performing value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the contribution degree of the participants corresponding to each participant;

And carrying out service distribution on the plurality of participant terminals according to the contribution degrees of the participants corresponding to the participant terminals to obtain participant service distribution information.

Optionally, in a first implementation manner of the first aspect of the present invention, the sampling the feature vector information by presetting a sampling device and selection probabilities corresponding to each of the participating ends to obtain sampling gradient information corresponding to each of the participating ends includes:

calculating the feature vector information according to the selection probability corresponding to each participation end by presetting a polynomial-based distribution algorithm in a sampler to obtain a selection vector corresponding to each participation end, wherein the feature vector information comprises model gradient information of each participation end;

and when the selection vector corresponding to each participation terminal is a preset value, sampling the model gradient information of each participation terminal according to the selection vector corresponding to each participation terminal to obtain sampling gradient information corresponding to each participation terminal.

Optionally, in a second implementation manner of the first aspect of the present invention, the performing, by using the preset evaluator and the reward value, value evaluation on the feature vector information to obtain a contribution degree of a participant corresponding to each participant includes:

Carrying out loss function calculation on the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain a loss function of the preset evaluator;

training the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain a target evaluator;

and performing value evaluation on the feature vector information through the target evaluator to obtain the contribution degree of the participants corresponding to each participant.

Optionally, in a third implementation manner of the first aspect of the present invention, the obtaining feature vector information of private data of a mechanism based on a plurality of participating ends, and performing selection probability prediction on the feature vector information through a preset evaluator, to obtain a selection probability corresponding to each participating end, includes:

respectively sending model gradient calculation instructions to a plurality of participating terminals so that each participating terminal obtains mechanism private data of the participating terminal according to the model gradient calculation instructions;

training a preset participant distribution model through the mechanism private data of the participant, and calculating the parameter gradient of the trained preset participant distribution model through a preset gradient descent algorithm to obtain feature vector information corresponding to each participant, wherein the mechanism private data comprises at least one of medical private data of a medical institution, financial business private data of a financial institution and insurance private data of an insurance institution;

And receiving the feature vector information corresponding to each participation terminal sent by each participation terminal, and carrying out selection probability calculation on the feature vector information corresponding to each participation terminal through a gradient cost function in a preset evaluator to obtain the selection probability corresponding to each participation terminal.

Optionally, in a fourth implementation manner of the first aspect of the present invention, the calculating, by the updated federal assessment model, a reward value includes:

acquiring verification set data of the feature vector information, and verifying the verification set data through the updated federal model to obtain a verification result;

calculating a verification loss value of the verification result and a moving average loss value of a preset period;

and carrying out difference value calculation on the verification loss value and the moving average loss value to obtain a reward value.

Optionally, in a fifth implementation manner of the first aspect of the present invention, the performing service allocation on the multiple participating ends according to the contribution degrees of the participants corresponding to the participating ends to obtain participant service allocation information includes:

acquiring a contribution degree occupation ratio of the contribution degrees of the participants corresponding to each participant terminal, and judging whether the contribution degree occupation ratio is larger than a preset threshold value;

If the contribution ratio is larger than a preset threshold value, a preset allocation strategy is called, and service allocation is carried out on the multiple participating ends to obtain participant service allocation information;

and if the contribution ratio is smaller than or equal to a preset threshold, carrying out service distribution on the plurality of participation terminals according to the contribution ratio to obtain participant service distribution information.

Optionally, in a sixth implementation manner of the first aspect of the present invention, after performing service allocation on the plurality of participant terminals according to the contribution degrees of the participants corresponding to the participant terminals to obtain the service allocation information of the participants, the method further includes:

obtaining abnormal information of the participant service allocation information, updating the participant service allocation information according to the abnormal information, and optimizing a determination strategy of the selection probability corresponding to each participant terminal.

The second aspect of the present invention provides a service allocation apparatus based on reinforcement learning, including:

the prediction module is used for acquiring feature vector information of the private data of the mechanism based on a plurality of participation terminals, and predicting the selection probability of the feature vector information through a preset evaluator to obtain the corresponding selection probability of each participation terminal, wherein the preset evaluator is used for evaluating the gradient value of the feature vector information provided by each participation terminal;

The sampling module is used for sampling the characteristic vector information through presetting a sampler and the selection probability corresponding to each participation end to obtain sampling gradient information corresponding to each participation end;

the updating module is used for updating the model parameters of the preset business evaluation federal model according to the sampling gradient information corresponding to each participating end to obtain an updated federal evaluation model, and calculating a reward value through the updated federal evaluation model, wherein the reward value is used for indicating the accumulated value of the return of the feature vector information to the updated federal evaluation model;

the evaluation module is used for carrying out value evaluation on the feature vector information through the preset evaluator and the rewarding value to obtain the contribution degree of the participants corresponding to each participant;

and the distribution module is used for carrying out service distribution on the plurality of the participation terminals according to the contribution degrees of the participants corresponding to the participation terminals to obtain the service distribution information of the participants.

Optionally, in a first implementation manner of the second aspect of the present invention, the sampling module includes:

the computing unit is used for computing the feature vector information according to the selection probability corresponding to each participation end through a polynomial-based distribution algorithm in a preset sampler to obtain the selection vector corresponding to each participation end, wherein the feature vector information comprises model gradient information of each participation end;

And the sampling unit is used for sampling the model gradient information of each participating end according to the selection vector corresponding to each participating end when the selection vector corresponding to each participating end is a preset value, so as to obtain the sampling gradient information corresponding to each participating end.

Optionally, in a second implementation manner of the second aspect of the present invention, the evaluation module is specifically configured to:

Optionally, in a third implementation manner of the second aspect of the present invention, the prediction module is specifically configured to:

Optionally, in a fourth implementation manner of the second aspect of the present invention, the update module is specifically configured to:

Optionally, in a fifth implementation manner of the second aspect of the present invention, the allocation module is specifically configured to:

Optionally, in a sixth implementation manner of the second aspect of the present invention, the reinforcement learning-based service allocation device further includes:

and the updating and optimizing module is used for acquiring the abnormal information of the participant service allocation information, updating the participant service allocation information according to the abnormal information, and optimizing the determination strategy of the selection probability corresponding to each participant terminal.

A third aspect of the present invention provides a reinforcement learning-based service distribution apparatus, including: a memory and at least one processor, the memory having instructions stored therein; the at least one processor invokes the instructions in the memory to cause the reinforcement learning-based traffic distribution device to perform the reinforcement learning-based traffic distribution method described above.

A fourth aspect of the present invention provides a computer readable storage medium having instructions stored therein which, when run on a computer, cause the computer to perform the reinforcement learning based service allocation method described above.

According to the technical scheme provided by the invention, the feature vector information based on the mechanism private data of a plurality of participation terminals is obtained, and the selection probability prediction is carried out on the feature vector information through a preset evaluator, so that the corresponding selection probability of each participation terminal is obtained; the characteristic vector information is sampled by presetting a sampler and the selection probability corresponding to each participation end, so as to obtain sampling gradient information corresponding to each participation end; updating model parameters of a preset business evaluation federation model according to sampling gradient information corresponding to each participating end to obtain an updated federation evaluation model, and calculating a reward value through the updated federation evaluation model; the value evaluation is carried out on the feature vector information through a preset evaluator and a reward value, so that the contribution degree of the participants corresponding to each participant is obtained; and carrying out service distribution on the plurality of participant terminals according to the contribution degree of the participants corresponding to each participant terminal to obtain the service distribution information of the participants. In the embodiment of the invention, the feature vector information of the private data of each mechanism based on a plurality of participation ends is selectively adopted according to the selection probability output by the preset evaluator, the model parameters of the preset business evaluation federal model are updated according to the sampling gradient information corresponding to each participation end, the value evaluation is carried out on the feature vector information through the preset evaluator and the rewarding value, the business distribution is carried out on the plurality of participation ends according to the contribution degree of the participants corresponding to each participation end, the calculation complexity is reduced, the accuracy of the contribution degree evaluation of the participants is improved, the efficiency of the contribution degree evaluation of the participants is improved, and the accuracy of the business distribution is further improved.

Drawings

FIG. 1 is a schematic diagram of an embodiment of a reinforcement learning-based service allocation method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of another embodiment of a reinforcement learning-based service allocation method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an embodiment of a reinforcement learning-based service distribution device according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of another embodiment of a reinforcement learning-based service distribution device according to an embodiment of the present invention;

fig. 5 is a schematic diagram of an embodiment of a reinforcement learning-based service distribution device according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a service distribution method, device and equipment based on reinforcement learning and a storage medium, which improves the accuracy of service distribution.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For easy understanding, the following describes a specific flow of an embodiment of the present invention, referring to fig. 1, and one embodiment of a service allocation method based on reinforcement learning in the embodiment of the present invention includes:

101. and acquiring feature vector information based on the private data of the mechanism of the plurality of participation terminals, and predicting the selection probability of the feature vector information through a preset evaluator to obtain the corresponding selection probability of each participation terminal, wherein the preset evaluator is used for evaluating the gradient value of the feature vector information provided by each participation terminal.

It can be understood that the execution subject of the present invention may be a reinforcement learning-based service allocation device, and may also be a terminal or a server, which is not limited herein. The embodiment of the invention is described by taking a server as an execution main body as an example.

Wherein the plurality of participating terminals include, but are not limited to, terminals or servers corresponding to medical institutions, insurance institutions, and financial institutions, such as: a unit in a financial institution, bank or other institution or center involved in finance corresponds to a terminal or server. The multiple participating terminals may be terminals or servers corresponding to the same type of mechanism, for example: the plurality of participating terminals are terminals or servers corresponding to the financial institutions; the multiple participating terminals may also be terminals or servers corresponding to different types of institutions, for example: the number of the plurality of participation ends is 3, one participation end is a terminal or a server corresponding to a financial institution, the other participation end is a terminal or a server corresponding to an insurance institution, and the other participation ends are terminals or servers corresponding to a medical institution.

The institution private data is non-shared and encrypted private data of the participating end, such as: medical private data of the medical institution, private data of various financial businesses of the financial institution. The feature vector information can be gradient information corresponding to model parameters when each participating end performs gradient descent processing on the model according to the private data of the institutions, and can also be the private data of the institutions of each participating end, and the occupancy rate of the federal model in the process of performing certain business processing on the private data of all institutions is preset.

The server builds a preset business evaluation federation model in advance according to the private data of the institutions of the plurality of participating ends. The server receives feature vector information based on mechanism private data sent by a plurality of participating ends respectively, calls a preset evaluator, wherein the preset evaluator is a deep neural network, the deep neural network comprises a proportion selection algorithm, the fitness of each piece of feature vector information in the feature vector information is calculated through the proportion selection algorithm, the cumulative probability that each piece of feature vector information is inherited to the next generation group is calculated according to the fitness of each piece of feature vector information, a random number uniformly distributed in a [0,1] interval is generated, the feature vector information is selected according to the random number, and the probability value of the plurality of pieces of feature vector information selected by each participating end is normalized, so that the selection probability corresponding to each participating end is obtained. The preset evaluator is used for evaluating gradient value provided by each participant, wherein the gradient value is the degree of acting of gradient information on model training or the degree of acting of a federal model on a certain service processing direction of private data of all institutions in preset service evaluation of private data of each participant.

102. And sampling the feature vector information by presetting the selection probability corresponding to the sampler and each participating end to obtain sampling gradient information corresponding to each participating end.

The server calls a deterministic algorithm in a preset sampler to generate a pseudorandom number sequence with gradient information between [0,1], and randomly samples the pseudorandom number sequence according to the selection probability corresponding to each participation end to obtain sampling gradient information corresponding to each participation end; the server calls a preset sampler to classify all or all gradient information corresponding to each participation end into a plurality of pieces of data according to the time sequence of the gradient information, and extracts corresponding data from each piece of data according to the selection probability corresponding to each participation end to obtain sampling gradient information corresponding to each participation end; the server may also call a preset sampler to classify all the gradient information into preset categories (the number of the preset categories includes one or more than one) to obtain gradient information of multiple categories, randomly extracting corresponding gradient information from the gradient information of each category according to the selection probability corresponding to each participating end, and combining the extracted gradient information of the multiple categories to obtain sampling gradient information corresponding to each participating end.

103. According to the sampling gradient information corresponding to each participating end, updating model parameters of a preset business evaluation federation model to obtain an updated federation evaluation model, and calculating a reward value through the updated federation evaluation model, wherein the reward value is used for indicating a return accumulated value of feature vector information for the updated federation evaluation model.

The server normalizes the sampling gradient information corresponding to each participating end to obtain comprehensive gradient information, and continuously adjusts and updates model parameters of the preset business evaluation federal model according to the comprehensive gradient information to obtain an updated federal evaluation model; the server can also calculate the attention of the sampling gradient information corresponding to each participating end through a preset attention mechanism to obtain the attention gradient information corresponding to each participating end, and continuously adjust and update the model parameters of the preset business evaluation federal model through the attention gradient information corresponding to each participating end to obtain an updated federal evaluation model so as to ensure the characteristics of all the sampling gradient information, and update the model parameters of the preset business evaluation federal model with a bias to obtain the updated federal evaluation model.

The server receives the mechanism private data of a plurality of participating ends respectively, divides the mechanism private data into mechanism data verification sets according to a preset proportion, acquires sampling gradient information corresponding to the mechanism data verification sets according to the execution process of acquiring the sampling gradient information, so as to obtain verification set data of feature vector information, verifies an updated federal assessment model through the verification set data to obtain a verification federal assessment model, and calculates a verification loss value of the verification federal assessment model and a moving average loss value of a preset period; calculating a difference value between the verification loss value and the moving average loss value to obtain a reward value, wherein the reward value can be a total sum of rewards (which can comprise a profit signal or an information gain) of a preset business evaluation federal model during training, the reward value comprises a positive reward value and a negative reward value, the reward value is used for obtaining an optimal strategy and/or an optimal path, and the reward value can be a cumulative value of rewards of the preset business evaluation federal model on the optimal strategy and/or the optimal path of certain business processing of private data of all institutions, for example: the return aggregate value for the best policy for the allocation of insurance order amount data items.

104. And carrying out value evaluation on the feature vector information by presetting an evaluator and a reward value to obtain the contribution degree of the participants corresponding to each participant.

The server acquires a service contribution factor, evaluates the service contribution value of the feature vector information through a preset evaluator, a reward value and the service contribution factor to obtain the corresponding participation contribution degree of each participation end, wherein the contribution degree of each participation end can be the contribution degree of the private data of the mechanism of each participation end when training a preset service evaluation federal model, and can also be the contribution degree of the private data of the mechanism of each participation end when the preset service evaluation federal model effectively and accurately processes the service on all the private data of the mechanism, for example, in the accurate prediction of the preset service evaluation federal model for the preset time period of financial benefit data (the private data of the mechanism of each participation end), the contribution degree of the private data of the mechanism of each participant and the accuracy of the private data of the mechanism of each participant. The service contribution factor is an influence factor for calculating the service contribution, for example: taking the business as an illustration of training an insurance order amount data prediction model based on federal learning, the business contribution degree influence factors are the accuracy of the insurance order, the provided information carding, the importance of the business type and the like.

105. And carrying out service distribution on the plurality of participant terminals according to the contribution degree of the participants corresponding to each participant terminal to obtain the service distribution information of the participants.

The server can judge whether the contribution degree of the participants corresponding to each participant is smaller than a preset target value, if yes, the participant is removed, the participant does not participate in service distribution, and if not, the service distribution is carried out on the plurality of participants according to the contribution degree of the participants corresponding to each participant, so that the service distribution information of the participants is obtained. Wherein the participant business allocation information may be a revenue allocation, a rewards allocation, and/or a priority setting based on the contribution of the institution's private data, etc.

In the embodiment of the invention, the feature vector information of the private data of each mechanism based on a plurality of participation ends is selectively adopted according to the selection probability output by the preset evaluator, the model parameters of the preset business evaluation federal model are updated according to the sampling gradient information corresponding to each participation end, the value evaluation is carried out on the feature vector information through the preset evaluator and the rewarding value, the business distribution is carried out on the plurality of participation ends according to the contribution degree of the participants corresponding to each participation end, the calculation complexity is reduced, the accuracy of the contribution degree evaluation of the participants is improved, the efficiency of the contribution degree evaluation of the participants is improved, and the accuracy of the business distribution is further improved.

Referring to fig. 2, another embodiment of a reinforcement learning-based service allocation method according to an embodiment of the present invention includes:

201. and acquiring feature vector information based on the private data of the mechanism of the plurality of participation terminals, and predicting the selection probability of the feature vector information through a preset evaluator to obtain the corresponding selection probability of each participation terminal, wherein the preset evaluator is used for evaluating the gradient value of the feature vector information provided by each participation terminal.

Specifically, the server respectively sends model gradient calculation instructions to a plurality of participating terminals, so that each participating terminal obtains mechanism private data of the participating terminal according to the model gradient calculation instructions; training a preset participant distribution model through mechanism private data of the participant, and calculating parameter gradients of the trained preset participant distribution model through a preset gradient descent algorithm to obtain feature vector information corresponding to each participant, wherein the mechanism private data comprises at least one of medical private data of a medical institution, financial business private data of a financial institution and insurance private data of an insurance institution; and receiving the characteristic vector information corresponding to each participating end sent by each participating end, and carrying out selection probability calculation on the characteristic vector information corresponding to each participating end through a gradient cost function in a preset evaluator to obtain the selection probability corresponding to each participating end.

For example, taking the private data of a institution as private data of a financial service as an example, the terminals 1 corresponding to the financial institution 1, the terminals 2 corresponding to the financial institution 2 and the terminals 3 corresponding to the finance 3 are taken as a plurality of participation terminals, the server respectively sends model gradient calculation instructions to the terminals 1, the terminals 2 and the terminals 3, the terminals 1 extract the corresponding private data 1 of the financial service from a database according to the model gradient calculation instructions, the private data 1 of the financial service is input into a preset participant distribution model 1, the private data 1 of the financial service is subjected to service distribution processing (i.e. training) through the preset participant distribution model 1, and the preset participants subjected to the service distribution processing (i.e. training) are calculated through a preset gradient descent algorithmDistributing the parameter gradient of the model 1 to obtain feature vector information 1 corresponding to the terminal 1, sending the feature vector information 1 to a server by the terminal 1, inputting the feature vector information into a gradient cost function after the server receives the feature vector information 1, and obtaining selection probability 1 corresponding to the terminal 1 through extracting the cost function, wherein the gradient cost function is specifically as follows: w=h _θ (delta), wherein w represents the probability of selection, h _θ And (delta) is a gradient cost function, delta represents feature vector information, phi represents a trainable parameter, and the selection probability 2 corresponding to the terminal 2 and the selection probability 3 corresponding to the terminal 3 can be obtained in sequence.

202. And calculating the feature vector information according to the selection probability corresponding to each participation end by presetting a polynomial-based distribution algorithm in the sampler to obtain the selection vector corresponding to each participation end, wherein the feature vector information comprises model gradient information of each participation end.

The server calculates model gradient information of each participating end according to the selection probability corresponding to each participating end through a probability formula in a preset sampler based on a polynomial distribution algorithm, and obtains selection vectors zeta= [ ζ1, zeta 2, zeta 3, … … zeta n ] corresponding to each participating end, wherein zeta= {0,1} and P (zeta=1) =w, P represents a probability value, and w represents the selection probability.

203. And when the selection vector corresponding to each participation end is a preset value, sampling the model gradient information of each participation end according to the selection vector corresponding to each participation end to obtain the sampling gradient information corresponding to each participation end.

In this embodiment, the preset value is preferably 1, and the server determines whether the selection vector corresponding to each participating end is the preset value (ζi=1), if so, randomly sampling or systematically sampling or hierarchically sampling the model gradient information of each participating end according to the selection vector corresponding to each participating end, so as to obtain sampling gradient information corresponding to each participating end; if not, the model gradient information of each participant is not sampled, each execution process of acquiring the selection vector corresponding to each participant is circularly executed to reacquire the selection vector until the reacquired selection vector is a preset value (namely ζi=1), and the model gradient information of each participant is sampled according to the reacquired selection vector corresponding to each participant, so as to obtain the sampling gradient information corresponding to each participant.

204. According to the sampling gradient information corresponding to each participating end, updating model parameters of a preset business evaluation federation model to obtain an updated federation evaluation model, and calculating a reward value through the updated federation evaluation model, wherein the reward value is used for indicating a return accumulated value of feature vector information for the updated federation evaluation model.

Specifically, the server acquires verification set data of the feature vector information, and verifies the verification set data by updating the federal model to obtain a verification result; calculating a verification loss value of a verification result and a moving average loss value of a preset period; and carrying out difference value calculation on the verification loss value and the moving average loss value to obtain a reward value.

When the server receives the characteristic vector information of the private data of the mechanism of each of the plurality of participating ends, the characteristic vector information of the private data of the mechanism is divided into verification set data according to a preset proportion, the verification set data is verified by updating the federal model to obtain a verification result, the verification loss value of the verification result is calculated by a preset loss value calculation formula, and the preset loss value calculation formula is specifically as follows:wherein l _v Represents a verification loss value, v represents that the data belongs to a verification set, i.e. a verification result, M represents all data item numbers of the verification set data, k represents the kth data item,representing the required loss functions, including a mean square error (mean square error, MSE) function, a root-mean-square error (RMSE) function, a cross entropy loss function, and the like, f _θ The method comprises the steps of representing updating a federation assessment model, wherein x represents input data, namely verification set data, and y represents a label corresponding to the verification set data; the server calculates a moving average loss value of a preset period through a preset formula, wherein the preset formula is specifically as follows: / >l _v Representing a verification loss value, T representing a moving average window length of a preset period, and delta representing a moving average reference of the preset period; subtracting the moving average loss value from the verification loss value to obtain a reward value.

205. And carrying out value evaluation on the feature vector information by presetting an evaluator and a reward value to obtain the contribution degree of the participants corresponding to each participant.

Specifically, the server calculates a loss function of the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain a loss function of a preset evaluator; training the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain a target evaluator; and carrying out value evaluation on the feature vector information through a target evaluator to obtain the contribution degree of the participants corresponding to each participant.

The server calculates the loss function of the rewards value and the selection vector through a calculation formula in the Monte Carlo strategy gradient reinfored algorithm to obtain the loss function of the preset evaluator, wherein the calculation formula in the Monte Carlo strategy gradient reinfored algorithm is specifically as follows:

where r represents the prize value, N represents the number of selection vectors, i represents the ith selection vector, s _i Representing the selection vector, delta _i Representing the sampling gradient information, h _φ (δ _i ) Representing the gradient cost function. The server updates the trainable parameter phi of the preset evaluator through the loss function of the preset evaluator so as to train the preset evaluator, and a target evaluator is obtained, wherein the formula for updating the trainable parameter phi is as follows:wherein, beta represents learning rate, l _h Representing the loss function of the preset evaluator. And the server carries out value evaluation on the feature vector information through the target evaluator to obtain the contribution degree of the participants corresponding to each participant.

206. And carrying out service distribution on the plurality of participant terminals according to the contribution degree of the participants corresponding to each participant terminal to obtain the service distribution information of the participants.

Specifically, the server acquires a contribution degree ratio of the contribution degrees of the participants corresponding to each participant terminal, and judges whether the contribution degree ratio is larger than a preset threshold value; if the contribution ratio is larger than a preset threshold value, a preset allocation strategy is called, and service allocation is carried out on a plurality of participation terminals to obtain participant service allocation information; and if the contribution ratio is smaller than or equal to a preset threshold, carrying out service distribution on the plurality of participation terminals according to the contribution ratio to obtain the service distribution information of the participants.

And if not, calculating the ratio between the contribution degree of the participants corresponding to the participants and the total contribution degree of the participants to obtain the contribution degree ratio of the contribution degree of the participants corresponding to the participants.

For example, taking service allocation as consideration allocation, the service allocation information of the participants is taken as consideration allocation information of the participants, the consideration allocation is taken as an example to account for the updated contribution degree of the feature vector information of each participant to the updated federal assessment model, the contribution degree of the participants corresponding to each participant is respectively 0.03 (participant 1), 0.24 (participant 2), 0.40 (participant 3) and 0.33 (participant 4), the preset contribution degree is 0.20, the preset threshold is 0.40, the total amount of consideration allocation is 100 ten thousand, 0.03 is removed to obtain participant service allocation information 1 (participant 1 obtains a consideration 0 element), and the contribution degree occupation ratio of the contribution degrees of the participants corresponding to 0.24 (participant 2), 0.40 (participant 3) and 0.33 (participant 4) is 0.247 (participant 2), 0.412 (participant 3) and 0.34 (participant 4), only 0.412 is greater than the preset threshold is 0.40, the total amount of consideration allocation is 100=0.51 in accordance with the preset contribution policy of allocation of the preset threshold is 0.20, the total amount of consideration allocation is 0.34=100.51 of the total amount of consideration allocation policy, the total amount of consideration allocation is allocated to the participant = 0.34 according to the preset contribution degree of the preset threshold is 0.34, and the total amount of consideration allocation is 0.51 of the total amount of consideration allocation is 0.34 in the preset allocation policy is 0.51, the participant traffic distribution information 2 (participant 2 rewards 24.7 yuan) and the participant traffic distribution information 4 (participant 4 rewards 34.0 yuan) are obtained.

Specifically, the server performs service distribution on a plurality of participant terminals according to the contribution degrees of the participants corresponding to the participant terminals, obtains the participant service distribution information, then obtains the abnormal information of the participant service distribution information, updates the participant service distribution information according to the abnormal information, and optimizes the determination strategy of the selection probability corresponding to the participant terminals.

The server encrypts and transmits the participant service allocation information to the auditing end, the auditing end decrypts and audits the participant service allocation information, if the participant service allocation information has abnormal information, the abnormal information is fed back to the server, the server matches a corresponding optimization mechanism according to the abnormal information, the optimization mechanism comprises an optimization algorithm, an optimization strategy, an optimized execution process and an optimized execution script, the abnormal information in the participant service allocation information is corrected (updated) through the optimization mechanism, and the determination strategy of the selection probability corresponding to each participant end is optimized through the optimization mechanism, wherein the determination strategy comprises model selection, model calculation, feature vector information selection and the like of the selection probability corresponding to each participant end, the participant service allocation information is updated according to the abnormal information, the determination strategy of the selection probability corresponding to each participant end is optimized, and therefore the calculation convenience, calculation accuracy and calculation efficiency of the execution process of the service allocation method based on reinforcement learning are improved, and the accuracy of service allocation is further improved.

The service allocation method based on reinforcement learning in the embodiment of the present invention is described above, and the service allocation device based on reinforcement learning in the embodiment of the present invention is described below, referring to fig. 3, one embodiment of the service allocation device based on reinforcement learning in the embodiment of the present invention includes:

the prediction module 301 is configured to obtain feature vector information based on private data of a mechanism of multiple participating ends, and predict selection probability of the feature vector information through a preset evaluator, so as to obtain a selection probability corresponding to each participating end, where the preset evaluator is configured to evaluate gradient value of the feature vector information provided by each participating end;

The sampling module 302 is configured to sample the feature vector information by presetting a sampler and a selection probability corresponding to each participating end, so as to obtain sampling gradient information corresponding to each participating end;

the updating module 303 is configured to update model parameters of a preset service evaluation federal model according to sampling gradient information corresponding to each participating end, obtain an updated federal evaluation model, and calculate a reward value through the updated federal evaluation model, where the reward value is used to indicate a cumulative value of returns of feature vector information to the updated federal evaluation model;

the evaluation module 304 is configured to perform value evaluation on the feature vector information by presetting an evaluator and a reward value, so as to obtain a contribution degree of the participant corresponding to each participant;

and the distribution module 305 is configured to perform service distribution on the multiple participating ends according to the contribution degrees of the participants corresponding to the participating ends, so as to obtain participant service distribution information.

The function implementation of each module in the service distribution device based on reinforcement learning corresponds to each step in the service distribution method embodiment based on reinforcement learning, and the function and implementation process thereof are not described in detail herein.

Referring to fig. 4, another embodiment of a reinforcement learning-based service allocation apparatus according to an embodiment of the present invention includes:

the sampling module 302 specifically includes:

the computing unit 3021 is configured to calculate, by presetting a polynomial-based distribution algorithm in the sampler, feature vector information according to selection probabilities corresponding to each of the participating ends, to obtain selection vectors corresponding to each of the participating ends, where the feature vector information includes model gradient information of each of the participating ends;

the sampling unit 3022 is configured to sample the model gradient information of each participating end according to the selection vector corresponding to each participating end when the selection vector corresponding to each participating end is a preset value, so as to obtain sampling gradient information corresponding to each participating end;

Optionally, the evaluation module 304 may be further specifically configured to:

carrying out loss function calculation on the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain a loss function of a preset evaluator;

and carrying out value evaluation on the feature vector information through a target evaluator to obtain the contribution degree of the participants corresponding to each participant.

Optionally, the prediction module 301 may be further specifically configured to:

training a preset participant distribution model through mechanism private data of the participant, and calculating parameter gradients of the trained preset participant distribution model through a preset gradient descent algorithm to obtain feature vector information corresponding to each participant, wherein the mechanism private data comprises at least one of medical private data of a medical institution, financial business private data of a financial institution and insurance private data of an insurance institution;

and receiving the characteristic vector information corresponding to each participating end sent by each participating end, and carrying out selection probability calculation on the characteristic vector information corresponding to each participating end through a gradient cost function in a preset evaluator to obtain the selection probability corresponding to each participating end.

Optionally, the updating module 303 may be further specifically configured to:

acquiring verification set data of the feature vector information, and verifying the verification set data by updating the federal model to obtain a verification result;

calculating a verification loss value of a verification result and a moving average loss value of a preset period;

Optionally, the allocation module 305 may be further specifically configured to:

acquiring a contribution degree occupation ratio of the contribution degrees of the participants corresponding to each participating end, and judging whether the contribution degree occupation ratio is larger than a preset threshold value;

if the contribution ratio is larger than a preset threshold value, a preset allocation strategy is called, and service allocation is carried out on a plurality of participation terminals to obtain participant service allocation information;

and if the contribution ratio is smaller than or equal to a preset threshold, carrying out service distribution on the plurality of participation terminals according to the contribution ratio to obtain the service distribution information of the participants.

Optionally, the service allocation device based on reinforcement learning further includes:

the update optimization module 306 is configured to obtain abnormal information of the participant service allocation information, update the participant service allocation information according to the abnormal information, and optimize a determination policy of a selection probability corresponding to each participant.

The function implementation of each module and each unit in the service distribution device based on reinforcement learning corresponds to each step in the service distribution method embodiment based on reinforcement learning, and the function and implementation process of the function implementation are not described in detail herein.

The reinforcement learning-based service allocation apparatus in the embodiment of the present invention is described in detail from the point of view of the modularized functional entity in fig. 3 and 4 above, and the reinforcement learning-based service allocation device in the embodiment of the present invention is described in detail from the point of view of hardware processing below.

Fig. 5 is a schematic structural diagram of a reinforcement learning-based service distribution device according to an embodiment of the present invention, where the reinforcement learning-based service distribution device 500 may have a relatively large difference due to different configurations or performances, and may include one or more processors (central processing units, CPU) 510 (e.g., one or more processors) and a memory 520, and one or more storage media 530 (e.g., one or more mass storage devices) storing application programs 533 or data 532. Wherein memory 520 and storage medium 530 may be transitory or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown), each of which may include a series of instruction operations on the reinforcement learning-based traffic distribution device 500. Still further, the processor 510 may be configured to communicate with the storage medium 530 to execute a series of instruction operations in the storage medium 530 on the reinforcement learning-based service distribution device 500.

The reinforcement learning based traffic distribution device 500 may also include one or more power sources 540, one or more wired or wireless network interfaces 550, one or more input/output interfaces 560, and/or one or more operating systems 531, such as Windows Server, mac OS X, unix, linux, freeBSD, and the like. It will be appreciated by those skilled in the art that the reinforcement learning based service distribution device structure shown in fig. 5 does not constitute a limitation of the reinforcement learning based service distribution device, and may include more or fewer components than shown, or may combine certain components, or may be arranged in different components.

The present invention also provides a computer readable storage medium, which may be a non-volatile computer readable storage medium, and which may also be a volatile computer readable storage medium, having stored therein instructions that, when executed on a computer, cause the computer to perform the steps of a reinforcement learning based service allocation method.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created from the use of blockchain nodes, and the like.

The blockchain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm and the like. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. The service distribution method based on reinforcement learning is characterized by comprising the following steps:

2. The reinforcement learning-based service allocation method according to claim 1, wherein the step of sampling the feature vector information by presetting a sampler and a selection probability corresponding to each of the participating terminals to obtain sampling gradient information corresponding to each of the participating terminals includes:

3. The reinforcement learning-based service allocation method according to claim 2, wherein the performing value evaluation on the feature vector information by the preset evaluator and the reward value to obtain the contribution degree of the participants corresponding to each participant comprises:

4. The reinforcement learning-based service distribution method according to claim 1, wherein the obtaining feature vector information of the private data of the mechanism based on the plurality of participating ends, and performing selection probability prediction on the feature vector information through a preset evaluator, to obtain the selection probability corresponding to each participating end, includes:

5. The reinforcement learning based traffic distribution method according to claim 1, wherein said calculating a prize value by said updated federal assessment model comprises:

6. The reinforcement learning-based service distribution method according to claim 1, wherein the performing service distribution on the plurality of participating terminals according to the contribution degrees of the participants corresponding to the participating terminals to obtain the participant service distribution information includes:

If the contribution ratio is larger than a preset threshold value, a preset allocation strategy is called, and service allocation is carried out on the multiple participating ends to obtain service allocation information of a first participant;

and if the contribution ratio is smaller than or equal to a preset threshold, carrying out service distribution on the plurality of participation terminals according to the contribution ratio to obtain second participant service distribution information.

7. The reinforcement learning-based service distribution method according to any one of claims 1 to 6, wherein the performing service distribution on the plurality of participating ends according to the contribution degrees of the participants corresponding to the participating ends, after obtaining the participant service distribution information, further includes:

8. A reinforcement learning-based service distribution device, characterized in that the reinforcement learning-based service distribution device comprises:

9. A reinforcement learning-based service distribution apparatus, characterized in that the reinforcement learning-based service distribution apparatus comprises: a memory and at least one processor, the memory having instructions stored therein;

The at least one processor invoking the instructions in the memory to cause the reinforcement learning based traffic distribution device to perform the reinforcement learning based traffic distribution method of any of claims 1-7.

10. A computer readable storage medium having instructions stored thereon, which when executed by a processor, implement the reinforcement learning based traffic distribution method according to any of claims 1-7.