WO2021208720A1 - Method and apparatus for service allocation based on reinforcement learning - Google Patents

Method and apparatus for service allocation based on reinforcement learning Download PDF

Info

Publication number
WO2021208720A1
WO2021208720A1 PCT/CN2021/083817 CN2021083817W WO2021208720A1 WO 2021208720 A1 WO2021208720 A1 WO 2021208720A1 CN 2021083817 W CN2021083817 W CN 2021083817W WO 2021208720 A1 WO2021208720 A1 WO 2021208720A1
Authority
WO
WIPO (PCT)
Prior art keywords
preset
participant
feature vector
participating terminal
value
Prior art date
Application number
PCT/CN2021/083817
Other languages
French (fr)
Chinese (zh)
Inventor
朱星华
王健宗
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021208720A1 publication Critical patent/WO2021208720A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0631Resource planning, allocation, distributing or scheduling for enterprises or organisations
    • G06Q10/06312Adjustment or analysis of established resource schedule, e.g. resource or task levelling, or dynamic rescheduling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • This application relates to the field of machine learning of artificial intelligence, and in particular to a method, device, equipment, and storage medium for business distribution based on reinforcement learning.
  • the inventor realizes that due to the differences in the quality of business information provided by different sub-enterprises, only assessing the business contribution of each sub-enterprise based on the amount of business information provided by the sub-enterprise is not only unconvincing, but also easily leads to Under the goal of optimizing one's own interests, obtaining large business contributions with a large amount of low-quality business information has a large negative impact on the accuracy of the overall business allocation, resulting in lower accuracy of business allocation.
  • This application provides a service distribution method, device, equipment and storage medium based on reinforcement learning, which are used to improve the accuracy of service distribution.
  • the first aspect of this application provides a business allocation method based on reinforcement learning, including:
  • the preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
  • Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
  • the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
  • participant contribution degree corresponding to each participating terminal service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
  • a second aspect of the present application provides a service distribution device based on reinforcement learning, wherein the service distribution device based on reinforcement learning includes: a memory, a processor, and stored in the memory and capable of running on the processor
  • the processor implements the following steps when executing the reinforcement learning-based business allocation program:
  • the preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
  • Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
  • the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
  • participant contribution degree corresponding to each participating terminal service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
  • the third aspect of the present application provides a computer-readable storage medium that stores computer instructions, and when the computer instructions are executed on a computer, the computer executes the following steps:
  • the preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
  • Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
  • the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
  • participant contribution degree corresponding to each participating terminal service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
  • the fourth aspect of the present application provides a service distribution device based on reinforcement learning, including:
  • the prediction module is used to obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal.
  • the preset The evaluator is used to evaluate the gradient value of the feature vector information provided by each participant;
  • a sampling module configured to sample the feature vector information by using a preset sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
  • the update module is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to the participating terminals to obtain the updated federation assessment model, and calculate the reward value through the updated federation assessment model.
  • the reward value is used to indicate the cumulative return value of the feature vector information to the updated federal evaluation model;
  • An evaluation module configured to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;
  • the distribution module is used for performing business distribution on the multiple participating terminals according to the participant contribution degree corresponding to each participating terminal to obtain the business distribution information of the participants.
  • the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the pre-selected information is calculated according to the sampling gradient information corresponding to each participating terminal.
  • the model parameters of the federated business evaluation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value.
  • the business is distributed to multiple participating terminals, which reduces The complexity of calculation improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
  • FIG. 1 is a schematic diagram of an embodiment of a service allocation method based on reinforcement learning in an embodiment of this application;
  • FIG. 2 is a schematic diagram of another embodiment of a service allocation method based on reinforcement learning in an embodiment of the application
  • FIG. 3 is a schematic diagram of an embodiment of a service distribution device based on reinforcement learning in an embodiment of the application
  • FIG. 4 is a schematic diagram of another embodiment of a service distribution device based on reinforcement learning in an embodiment of the application
  • Fig. 5 is a schematic diagram of an embodiment of a service distribution device based on reinforcement learning in an embodiment of the application.
  • the embodiments of the present application provide a method, device, equipment, and storage medium for service distribution based on reinforcement learning, which improve the accuracy of service distribution.
  • An embodiment of the service allocation method based on reinforcement learning in the embodiment of the present application includes:
  • the execution subject of this application may be a service distribution device based on reinforcement learning, and may also be a terminal or a server, which is not specifically limited here.
  • the embodiment of the present application takes the server as the execution subject as an example for description.
  • multiple participating terminals include, but are not limited to, terminals or servers corresponding to medical institutions, insurance institutions, and financial institutions, for example, terminals or servers corresponding to units in financial institutions, banks, or other institutions or centers involved in finance.
  • Multiple participating terminals can be terminals or servers corresponding to the same type of institutions, for example: multiple participating terminals are terminals or servers corresponding to financial institutions; multiple participating terminals can also be terminals or servers corresponding to different types of institutions, for example: multiple The number of participating terminals is 3, one participating terminal is a terminal or server corresponding to a financial institution, the other participating terminal is a terminal or server corresponding to an insurance institution, and the remaining one participating terminal is a terminal or server corresponding to a medical institution.
  • Institutional private data is non-shared and encrypted private data of the participating end, for example: private medical data of medical institutions, private data of various financial services of financial institutions.
  • the feature vector information can be the gradient information corresponding to the model parameters when each participating terminal performs gradient descent processing on the model according to the private data of the organization, or it can be the private data of the organization of each participating terminal.
  • the private data of all organizations is Proportion of processing a certain business.
  • the server constructs a preset business evaluation federation model in advance based on the private data of multiple participating institutions.
  • the server receives the feature vector information based on the organization's private data sent by multiple participating terminals, and calls the preset evaluator.
  • the preset evaluator is a deep neural network.
  • the deep neural network includes a proportional selection algorithm. The feature is calculated through the proportional selection algorithm.
  • the fitness of each feature vector information in the vector information calculates the cumulative probability of each feature vector information being inherited to the next generation population, and generate a uniformly distributed random number in the interval [0,1].
  • the feature vector information is selected according to the random number, and the probability values of the multiple feature vector information selected by each participating terminal are normalized to obtain the selection probability corresponding to each participating terminal.
  • the preset evaluator is used to evaluate the gradient value provided by each participant.
  • the gradient value is the degree to which the gradient information plays a role in model training, or the private data of each participant’s organization.
  • the federal model is private to all organizations in the preset business evaluation The extent to which a certain business processing direction of data plays a role.
  • the feature vector information is gradient information.
  • the server calls the deterministic algorithm in the preset sampler to generate a pseudo-random number sequence with gradient information between [0, 1]. Randomly sample the number sequence to obtain the sampling gradient information corresponding to each participant; the server calls the preset sampler to classify all or the corresponding gradient information of each participant into multiple pieces of data according to the time sequence of the gradient information, corresponding to each participant The selection probability extracts the corresponding data from each piece of data to obtain the sampling gradient information corresponding to each participant; the server can also call the preset sampler to classify all gradient information into preset categories (the number of preset categories includes one Or more than one) to obtain the gradient information of multiple categories, randomly extract the corresponding gradient information from the gradient information of each category according to the corresponding selection probability of each participant, and combine the gradient information in the multiple extracted categories, Obtain the sampling gradient information corresponding to each participant.
  • the server normalizes the sampling gradient information corresponding to each participating terminal to obtain comprehensive gradient information.
  • the comprehensive gradient information it continuously adjusts and updates the model parameters of the preset business evaluation federation model, thereby obtaining an updated federation evaluation model;
  • the server can also use the preset attention mechanism to perform attention calculation on the sampling gradient information corresponding to each participating terminal to obtain the attention gradient information corresponding to each participating terminal.
  • the preset The model parameters of the business evaluation federation model are continuously adjusted and updated to obtain an updated federation evaluation model to ensure the characteristics of all sampling gradient information and to update the model parameters of the preset business evaluation federation model with a certain degree , Get the updated federal evaluation model.
  • the server receives the institutional private data of multiple participating ends, divides the institutional private data into the institutional data verification set according to the preset ratio, and obtains the sampling gradient information corresponding to the institutional data verification set according to the above-mentioned execution process of obtaining sampling gradient information, thereby Obtain the verification set data of the feature vector information, verify the updated federal evaluation model through the verification set data, obtain the verification federal evaluation model, calculate the verification loss value of the verification federal evaluation model, and the moving average loss value of the preset time period; verify the verification loss The difference between the value and the moving average loss value is calculated to obtain the reward value, where the reward value can be the total return of the preset business evaluation federated model during training (which can include the income signal or information gain), and the reward value includes the positive reward value and Negative reward value, the reward value is used to obtain the optimal strategy and/or the best path.
  • the reward value can be the return of the best strategy and/or the best path for the preset business evaluation federation model to process the private data of all institutions for a certain business Cumulative value, for example: the cumulative return value of the best strategy for the allocation of insurance order amount data items.
  • the server obtains the business contribution degree impact factor, and evaluates the business contribution value of the feature vector information through the preset evaluator, reward value and business contribution degree impact factor, and obtains the participation contribution degree corresponding to each participant.
  • the participant contribution degree can be Evaluate the contribution of the private data of the participating institutions to the training of the federal model for the preset business, and also evaluate the private data of the participating institutions when the preset business evaluation federal model effectively and accurately process the private data of all institutions.
  • the business contribution degree influence factor is the factor that calculates the business contribution degree. For example, taking the business as an example to train an insurance order amount data prediction model based on federal learning, the business contribution degree influence factor is the accuracy of the insurance order, The information provided and the importance of the type of business, etc.
  • participant contribution degree corresponding to each participating terminal perform business distribution on multiple participating terminals to obtain participant business distribution information.
  • the server can judge whether the contribution of each participant corresponding to the participant is less than the preset target value, if it is, it will remove the participant, and the participant will not participate in the business distribution, if not, it will be based on the participant's contribution corresponding to each participant. Degree, perform business distribution to multiple participating terminals, and obtain the business distribution information of the participants. Among them, the business distribution information of the participants may be income distribution, reward distribution and/or priority setting based on the contribution of the private data of the organization.
  • the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset service is determined according to the sampling gradient information corresponding to each participating terminal.
  • the model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value.
  • the business is distributed to the multiple participants, which reduces the calculation Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
  • another embodiment of the service allocation method based on reinforcement learning in the embodiment of the present application includes:
  • the preset evaluator is used to evaluate each participating terminal The gradient value of the feature vector information provided.
  • the server sends model gradient calculation instructions to multiple participating terminals respectively, so that each participating terminal obtains the private institutional data of the participating terminal according to the model gradient calculation instruction; trains the preset participant allocation model through the private institutional data of the participating terminal, And through the preset gradient descent algorithm, the parameter gradient of the trained preset participant allocation model is calculated, and the corresponding feature vector information of each participant is obtained.
  • the private data of the institution includes the private medical data of the medical institution and the financial business of the financial institution.
  • At least one of private data and insurance private data of insurance institutions receiving the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculating the corresponding feature vector of each participating terminal through the gradient value function in the preset evaluator Information is calculated for selection probability, and the selection probability corresponding to each participating terminal is obtained.
  • the multiple participating terminals are the terminal 1 corresponding to the financial institution 1, the terminal 2 corresponding to the financial institution 2, and the terminal 3 corresponding to the financial institution 3.
  • 2 and terminal 3 send model gradient calculation instructions
  • terminal 1 extracts the corresponding financial business private data 1 from the database according to the model gradient calculation instruction, and inputs the financial business private data 1 to the preset participant allocation model 1, and participates through the preset Participant allocation model 1 performs business allocation processing (ie training) on the financial business private data 1, and calculates the parameter gradient of the preset participant allocation model 1 after business allocation processing (ie training) through the preset gradient descent algorithm, and obtains the terminal 1 corresponds to feature vector information 1.
  • Terminal 1 sends feature vector information 1 to the server.
  • the feature vector information is calculated according to the selection probability of each participating terminal to obtain the selection vector corresponding to each participating terminal.
  • the feature vector information includes model gradient information of each participating terminal.
  • the selection vector corresponding to each participating terminal is a preset value
  • the preset value is preferably 1.
  • the information is randomly sampled or systematically sampled or stratified sampling to obtain the sampling gradient information corresponding to each participating terminal; if not, the model gradient information of each participating terminal is not sampled, and each of the above methods for obtaining the selection vector corresponding to each participating terminal is performed in a loop.
  • the sampling gradient information corresponding to each participating terminal update the model parameters of the preset business evaluation federation model to obtain the updated federation assessment model, and calculate the reward value through the updated federation evaluation model.
  • the reward value is used to indicate that the feature vector information is effective Update the cumulative return value of the federal assessment model.
  • the server normalizes the sampling gradient information corresponding to each participating terminal to obtain comprehensive gradient information.
  • the comprehensive gradient information it continuously adjusts and updates the model parameters of the preset business evaluation federation model, thereby obtaining an updated federation evaluation model;
  • the server can also use the preset attention mechanism to perform attention calculation on the sampling gradient information corresponding to each participating terminal, and obtain the attention gradient information corresponding to each participating terminal.
  • the preset The model parameters of the business evaluation federation model are continuously adjusted and updated to obtain an updated federation evaluation model to ensure the characteristics of all sampling gradient information and to update the model parameters of the preset business evaluation federation model with a certain degree , Get the updated federal evaluation model.
  • the server obtains the verification set data of the feature vector information, verifies the verification set data by updating the federated model, and obtains the verification result; calculates the verification loss value of the verification result and the moving average loss value of the preset time period; compares the verification loss value Calculate the difference with the moving average loss value to get the reward value.
  • the server When the server receives the feature vector information of the organization's private data of multiple participating ends, it divides the feature vector information of the organization's private data into verification set data according to a preset ratio, and verifies the verification set data by updating the federation model to obtain the verification result , Calculate the verification loss value of the verification result through the preset loss value calculation formula.
  • the preset loss value calculation formula is as follows: Among them, l v indicates the verification loss value, v indicates that the data belongs to the verification set, that is, the verification result, M indicates the number of all data items in the verification set data, and k indicates the k-th data item.
  • f ⁇ represents the updated federated evaluation model
  • x represents the input data, that is, the validation set data
  • y represents the corresponding label of the validation set data
  • the server calculates the moving average loss value for the preset time period through a preset formula, and the preset formula is as follows: l v represents the verification loss value, T represents the moving average window length of the preset time period, and ⁇ represents the moving average benchmark of the preset time period; the verification loss value is subtracted from the moving average loss value to obtain the reward value.
  • the value of the feature vector information is evaluated through the preset evaluator and the reward value, and the participant contribution degree corresponding to each participating terminal is obtained.
  • the server uses the preset Monte Carlo strategy gradient algorithm to calculate the loss function of the reward value and the selection vector to obtain the loss function of the preset evaluator; through the loss function of the preset evaluator, perform the calculation on the preset evaluator. Training until the loss function converges to obtain the target evaluator; through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution corresponding to each participating end is obtained.
  • the server uses the calculation formula in the Monte Carlo strategy gradient reinforce algorithm to calculate the loss function of the reward value and the selection vector to obtain the loss function of the preset evaluator.
  • the calculation formula in the Monte Carlo strategy gradient reinforce algorithm is as follows:
  • r represents the reward value
  • N represents the number of selection vectors
  • i represents the ith selection vector
  • s i represents the selection vector
  • ⁇ i represents sampling gradient information
  • h ⁇ ( ⁇ i ) represents the gradient value function.
  • the server updates the trainable parameter ⁇ of the preset evaluator through the loss function of the preset evaluator to realize the training of the preset evaluator and obtain the target evaluator.
  • the formula for updating the trainable parameter ⁇ is as follows: Among them, ⁇ represents the learning rate, and l h represents the loss function of the preset estimator.
  • the server evaluates the value of the feature vector information through the target evaluator, and obtains the participant contribution degree corresponding to each participant.
  • participant contribution degree corresponding to each participating terminal perform business distribution on multiple participating terminals to obtain participant business distribution information.
  • the server obtains the contribution percentage value of the participant contribution corresponding to each participating end, and determines whether the contribution percentage value is greater than the preset threshold; if the contribution percentage value is greater than the preset threshold, the preset distribution strategy is invoked, Perform business distribution to multiple participating terminals to obtain the participant’s business distribution information; if the contribution rate is less than or equal to the preset threshold, then according to the contribution rate, the business distribution is performed on multiple participating terminals to obtain the participant’s business distribution information .
  • the server judges whether the participant contribution degree corresponding to each participating terminal is less than the preset contribution degree, and if so, eliminates the corresponding participant contribution degree less than the preset contribution degree, and calculates the participant contribution degree corresponding to each participating terminal after the elimination. Respectively and the ratio between the total participant contribution and the value, the contribution ratio of each participant's contribution corresponding to each participant's end is obtained. If not, the participant's contribution corresponding to each participant's end is calculated separately and the total participation The ratio between the contribution of the participants is obtained, and the contribution ratio of the contribution of each participant corresponding to each participating terminal is obtained.
  • participant For example, take the business distribution as the remuneration distribution, and the participant’s business distribution information is the participant’s remuneration distribution information.
  • Participant contributions are 0.03 (participating terminal 1), 0.24 (participating terminal 2), 0.40 (participating terminal 3), and 0.33 (participating terminal 4).
  • the preset contribution rate is 0.20, the preset threshold is 0.40, and the reward is distributed.
  • the total amount is 1 million, then 0.03 is removed, and the participant's business distribution information 1 (participating end 1 gets paid 0 yuan), and 0.24 (participating end 2), 0.40 (participating end 3) and 0.33 (participating end 4) are obtained respectively
  • the preset distribution strategy is in addition to the total amount distributed according to the reward
  • the amount of money corresponding to the contribution percentage value is extra, and an additional 100,000 is allocated.
  • the server distributes services to multiple participating terminals according to the participant contribution degree corresponding to each participating terminal. After obtaining the participant's business distribution information, it also obtains the abnormal information of the participant's business distribution information, and performs business assignments to the participants based on the abnormal information. The allocation information is updated, and the selection probability determination strategy corresponding to each participating terminal is optimized.
  • the server encrypts the participant’s business distribution information and sends it to the auditing side.
  • the auditing side decrypts and audits the participant’s business distribution information. If there is abnormal information in the participant’s business distribution information, the abnormal information will be fed back to the server, and the server will respond to the abnormal information. Match the corresponding optimization mechanism.
  • the optimization mechanism includes optimization algorithms, optimization strategies, optimized execution processes and optimized execution scripts.
  • the abnormal information in the participant’s business allocation information is corrected (updated) through the optimization mechanism, and the optimization mechanism is used to correct
  • the selection probability determination strategy corresponding to each participating terminal is optimized.
  • the determination strategy includes model selection, model calculation, and feature vector information selection of the selection probability corresponding to each participating terminal. Update and optimize the determination strategy of the selection probability corresponding to each participating terminal, which improves the calculation convenience, calculation accuracy and calculation efficiency of the execution process of the business allocation method based on reinforcement learning, thereby improving the accuracy of business allocation.
  • the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset business is determined according to the sampling gradient information corresponding to each participating terminal.
  • the model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value.
  • the business is distributed to the multiple participants, which reduces the calculation Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
  • the service distribution method based on reinforcement learning in the embodiment of this application includes:
  • the prediction module 301 is used to obtain the feature vector information based on the private data of the institutions of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal.
  • the preset evaluator is used for Evaluate the gradient value of the feature vector information provided by each participant;
  • the sampling module 302 is configured to sample the feature vector information through the preset sampler and the corresponding selection probability of each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
  • the update module 303 is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to each participating terminal to obtain the updated federation evaluation model, and calculate the reward value through the updated federation evaluation model, and the reward value is used to indicate The cumulative value of eigenvector information for the update of the federal evaluation model;
  • the evaluation module 304 is used to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;
  • the distribution module 305 is used to distribute services to multiple participating terminals according to the participant contribution corresponding to each participating terminal to obtain participant service distribution information.
  • each module in the above-mentioned reinforcement learning-based service distribution device corresponds to the steps in the above-mentioned embodiment of the above-mentioned reinforcement learning-based service distribution method, and the functions and realization processes thereof will not be repeated here.
  • the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset service is determined according to the sampling gradient information corresponding to each participating terminal.
  • the model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value.
  • the business is distributed to the multiple participants, which reduces the calculation Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
  • FIG. 4 another embodiment of a service distribution apparatus based on reinforcement learning in an embodiment of the present application includes:
  • the prediction module 301 is used to obtain the feature vector information based on the private data of the institutions of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal.
  • the preset evaluator is used for Evaluate the gradient value of the feature vector information provided by each participant;
  • the sampling module 302 is configured to sample the feature vector information through the preset sampler and the corresponding selection probability of each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
  • the sampling module 302 specifically includes:
  • the calculation unit 3021 is configured to calculate the feature vector information according to the selection probability corresponding to each participating terminal through the polynomial distribution algorithm in the preset sampler to obtain the selection vector corresponding to each participating terminal.
  • the feature vector information includes the corresponding selection vector of each participating terminal.
  • Model gradient information
  • the sampling unit 3022 is configured to sample the model gradient information of each participating terminal according to the selection vector corresponding to each participating terminal when the selection vector corresponding to each participating terminal is a preset value to obtain sampling gradient information corresponding to each participating terminal;
  • the update module 303 is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to each participating terminal to obtain the updated federation evaluation model, and calculate the reward value through the updated federation evaluation model, and the reward value is used to indicate The cumulative value of eigenvector information for the update of the federal evaluation model;
  • the evaluation module 304 is used to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;
  • the allocation module 305 is used to allocate services to multiple participating terminals according to the participant contribution corresponding to each participating terminal to obtain the participant's business distribution information.
  • the evaluation module 304 can also be specifically used for:
  • the target evaluator Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.
  • the prediction module 301 may also be specifically used to:
  • Institutional private data includes at least one of medical private data of medical institutions, financial business private data of financial institutions, and insurance private data of insurance institutions;
  • the feature vector information corresponding to each participating terminal is received, and the selection probability of the feature vector information corresponding to each participating terminal is calculated through the gradient value function in the preset evaluator to obtain the selection probability corresponding to each participating terminal.
  • the update module 303 can also be specifically used for:
  • the difference between the verification loss value and the moving average loss value is calculated to obtain the reward value.
  • the allocation module 305 may also be specifically used for:
  • the preset distribution strategy is invoked to distribute services to multiple participating terminals to obtain the participant's service distribution information
  • the contribution rate is less than or equal to the preset threshold value, then according to the contribution rate, business distribution is performed on multiple participating terminals to obtain participant business distribution information.
  • a service distribution device based on reinforcement learning also includes:
  • the update optimization module 306 is used to obtain abnormal information of the participant's business allocation information, update the participant's business allocation information according to the abnormal information, and optimize the determination strategy of the selection probability corresponding to each participant.
  • each module and each unit in the above-mentioned reinforcement learning-based service distribution device corresponds to each step in the above-mentioned embodiment of the above-mentioned service distribution method based on reinforcement learning, and the functions and realization processes thereof will not be repeated here.
  • the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset service is determined according to the sampling gradient information corresponding to each participating terminal.
  • the model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value.
  • the business distribution of multiple participants is performed, which reduces the computational cost. Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
  • FIG. 5 is a schematic structural diagram of a service distribution device based on reinforcement learning provided by an embodiment of the present application.
  • the service distribution device 500 based on reinforcement learning may have relatively large differences due to different configurations or performances, and may include one or more A processor (central processing units, CPU) 510 (for example, one or more processors), a memory 520, and one or more storage media 530 (for example, one or one storage device with a large amount of storage) storing application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the service distribution device 500 based on reinforcement learning.
  • the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the service distribution device 500 based on reinforcement learning.
  • the service distribution device 500 based on reinforcement learning may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or one or more operating systems 531, For example, Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc.
  • Windows Serve Windows Serve
  • Mac OS X Unix
  • Linux FreeBSD
  • FIG. 5 does not constitute a limitation on the service distribution device based on reinforcement learning, and may include more or less components than shown in the figure, or a combination of certain components. Some components, or different component arrangements.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium.
  • the computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the read storage medium, and when the instructions are run on the computer, the computer executes the steps of the business distribution method based on reinforcement learning.
  • the computer-readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store information according to the blockchain node Use the created data, etc.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .

Landscapes

  • Business, Economics & Management (AREA)
  • Engineering & Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Economics (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Physics & Mathematics (AREA)
  • Strategic Management (AREA)
  • Physics & Mathematics (AREA)
  • Tourism & Hospitality (AREA)
  • Data Mining & Analysis (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • General Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Educational Administration (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Development Economics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method and an apparatus for service allocation based on reinforcement learning, a device, and a storage medium, relating to the technical field of artificial intelligence, and used to improve the accuracy of service allocation. The method for service allocation based on reinforcement learning comprises: performing selection probability prediction on feature vector information that is based on private organizational structure data of a plurality of participants to obtain a selection probability; by means of the selection probability, sampling the feature vector information to obtain sample gradient information; on the basis of the sample gradient information, updating model parameters of a pre-set federated service evaluation model to obtain an updated federated evaluation model, and by means of the updated federated evaluation model, calculating a reward value; by means of a pre-set evaluator and the reward value, performing value evaluation on the feature vector information to obtain participant contributions; and on the basis of the participant contributions, performing service allocation to the plurality of participants to obtain participant service allocation information. In addition, the present method further relates to blockchain technology, and the private organizational structure data can be stored in a blockchain.

Description

基于强化学习的业务分配方法、装置、设备及存储介质Business distribution method, device, equipment and storage medium based on reinforcement learning
本申请要求于2020年11月19日提交中国专利局、申请号为202011298673.1、发明名称为“基于强化学习的业务分配方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on November 19, 2020, the application number is 202011298673.1, and the invention title is "Reinforcement learning-based business distribution methods, devices, equipment and storage media", and its entire contents Incorporated in the application by reference.
技术领域Technical field
本申请涉及人工智能的机器学习领域,尤其涉及一种基于强化学习的业务分配方法、装置、设备及存储介质。This application relates to the field of machine learning of artificial intelligence, and in particular to a method, device, equipment, and storage medium for business distribution based on reinforcement learning.
背景技术Background technique
随着互联网技术和业务的发展,各企业、机构对业务信息的管控和调用也成为了各企业、机构的着重关注点,例如:根据企业的所属的所有子企业提供的业务信息,对各子企业进行业务分配。With the development of Internet technology and business, the management, control and invocation of business information by enterprises and institutions has also become the focus of attention of all enterprises and institutions. For example: The enterprise allocates business.
目前,为了更好地根据企业的所属的所有子企业提供的业务信息,对各子企业进行业务分配,一般是根据子企业提供的业务信息的数据量对子企业的业务贡献度进行评定,根据评定的业务贡献度对子企业进行业务分配。At present, in order to better allocate business to each sub-enterprise based on the business information provided by all the sub-enterprises of the enterprise, it is generally based on the data volume of the business information provided by the sub-enterprise to assess the business contribution of the sub-enterprise. The assessed business contribution is assigned to the sub-enterprise.
发明人意识到由于不同子企业所提供的业务信息的质量存在差异,仅从子企业提供的业务信息的数据量评估各子企业的业务贡献度不仅缺乏说服力的,还容易导致子企业在最大化己方利益为导向的目标下,以大量低质量的业务信息获取较大的业务贡献度,对总体业务分配的准确性造成了较大不良影响,导致业务分配的准确性较低。The inventor realizes that due to the differences in the quality of business information provided by different sub-enterprises, only assessing the business contribution of each sub-enterprise based on the amount of business information provided by the sub-enterprise is not only unconvincing, but also easily leads to Under the goal of optimizing one's own interests, obtaining large business contributions with a large amount of low-quality business information has a large negative impact on the accuracy of the overall business allocation, resulting in lower accuracy of business allocation.
发明内容Summary of the invention
本申请提供一种基于强化学习的业务分配方法、装置、设备及存储介质,用于提高业务分配的准确性。This application provides a service distribution method, device, equipment and storage medium based on reinforcement learning, which are used to improve the accuracy of service distribution.
本申请第一方面提供了一种基于强化学习的业务分配方法,包括:The first aspect of this application provides a business allocation method based on reinforcement learning, including:
获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,所述预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息;Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
根据所述各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过所述更新联邦评估模型计算奖励值,所述奖励值用于指示所述特征向量信息对于所述更新联邦评估模型的回报累计值;According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;
根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息。According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
本申请第二方面提供了一种基于强化学习的业务分配设备,其中,所述基于强化学习的业务分配设备包括:存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的基于强化学习的业务分配程序,所述处理器执行所述基于强化学习的业务分配程序时实现如下步骤:A second aspect of the present application provides a service distribution device based on reinforcement learning, wherein the service distribution device based on reinforcement learning includes: a memory, a processor, and stored in the memory and capable of running on the processor In the reinforcement learning-based business allocation program, the processor implements the following steps when executing the reinforcement learning-based business allocation program:
获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,所述预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息;Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
根据所述各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过所述更新联邦评估模型计算奖励值,所述奖励值用于指示所述特征向量信息对于所述更新联邦评估模型的回报累计值;According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;
根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息。According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
本申请的第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The third aspect of the present application provides a computer-readable storage medium that stores computer instructions, and when the computer instructions are executed on a computer, the computer executes the following steps:
获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,所述预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息;Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
根据所述各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过所述更新联邦评估模型计算奖励值,所述奖励值用于指示所述特征向量信息对于所述更新联邦评估模型的回报累计值;According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;
根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息。According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
本申请第四方面提供了一种基于强化学习的业务分配装置,包括:The fourth aspect of the present application provides a service distribution device based on reinforcement learning, including:
预测模块,用于获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,所述预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;The prediction module is used to obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset The evaluator is used to evaluate the gradient value of the feature vector information provided by each participant;
采样模块,用于通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息;A sampling module, configured to sample the feature vector information by using a preset sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
更新模块,用于根据所述各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过所述更新联邦评估模型计算奖励值,所述奖励值用于指示所述特征向量信息对于所述更新联邦评估模型的回报累计值;The update module is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to the participating terminals to obtain the updated federation assessment model, and calculate the reward value through the updated federation assessment model. The reward value is used to indicate the cumulative return value of the feature vector information to the updated federal evaluation model;
评估模块,用于通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;An evaluation module, configured to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;
分配模块,用于根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息。The distribution module is used for performing business distribution on the multiple participating terminals according to the participant contribution degree corresponding to each participating terminal to obtain the business distribution information of the participants.
本申请提供的技术方案中,通过根据预置评估器输出的选择概率对各基于多个参与端的机构私密数据的特征向量信息进行选择性采用,并根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,通过预置评估器和奖励值,对特征向量信息进行价值评估,根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,降低了计算的复杂度,提高了参与者贡献度评估的准确性,提高了参与者贡献度评估的效率,进而提高了业务分配的准确性。In the technical solution provided by this application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the pre-selected information is calculated according to the sampling gradient information corresponding to each participating terminal. The model parameters of the federated business evaluation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business is distributed to multiple participating terminals, which reduces The complexity of calculation improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
附图说明Description of the drawings
图1为本申请实施例中基于强化学习的业务分配方法的一个实施例示意图;FIG. 1 is a schematic diagram of an embodiment of a service allocation method based on reinforcement learning in an embodiment of this application;
图2为本申请实施例中基于强化学习的业务分配方法的另一个实施例示意图;2 is a schematic diagram of another embodiment of a service allocation method based on reinforcement learning in an embodiment of the application;
图3为本申请实施例中基于强化学习的业务分配装置的一个实施例示意图;3 is a schematic diagram of an embodiment of a service distribution device based on reinforcement learning in an embodiment of the application;
图4为本申请实施例中基于强化学习的业务分配装置的另一个实施例示意图;4 is a schematic diagram of another embodiment of a service distribution device based on reinforcement learning in an embodiment of the application;
图5为本申请实施例中基于强化学习的业务分配设备的一个实施例示意图。Fig. 5 is a schematic diagram of an embodiment of a service distribution device based on reinforcement learning in an embodiment of the application.
具体实施方式Detailed ways
本申请实施例提供了一种基于强化学习的业务分配方法、装置、设备及存储介质,提高了业务分配的准确性。The embodiments of the present application provide a method, device, equipment, and storage medium for service distribution based on reinforcement learning, which improve the accuracy of service distribution.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中基于强化学习的业务分配方法的一个实施例包括:For ease of understanding, the following describes the specific process of the embodiment of the present application. Please refer to FIG. 1. An embodiment of the service allocation method based on reinforcement learning in the embodiment of the present application includes:
101、获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对特征向量信息进行选择概率预测,得到各参与端对应的选择概率,预置评估器用于评估各参与端所提供的特征向量信息的梯度价值。101. Obtain feature vector information based on the private data of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal. The preset evaluator is used to evaluate each participating terminal The gradient value of the feature vector information provided.
可以理解的是,本申请的执行主体可以为基于强化学习的业务分配装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It is understandable that the execution subject of this application may be a service distribution device based on reinforcement learning, and may also be a terminal or a server, which is not specifically limited here. The embodiment of the present application takes the server as the execution subject as an example for description.
其中,多个参与端包括但不限于医疗机构、保险机构和金融机构对应的终端或服务器,例如:金融机构中的单位、银行或其他涉及到金融的机构或中心对应的终端或服务器。多个参与端可为相同类型机构对应的终端或服务器,例如:多个参与端均为金融机构对应的终端或服务器;多个参与端也可为不同类型机构对应的终端或服务器,例如:多个参与端的数量为3,一个参与端为金融机构对应的终端或服务器,另一个参与端为保险机构对应的终端或服务器,其余一个参与端为医疗机构对应的终端或服务器。Among them, multiple participating terminals include, but are not limited to, terminals or servers corresponding to medical institutions, insurance institutions, and financial institutions, for example, terminals or servers corresponding to units in financial institutions, banks, or other institutions or centers involved in finance. Multiple participating terminals can be terminals or servers corresponding to the same type of institutions, for example: multiple participating terminals are terminals or servers corresponding to financial institutions; multiple participating terminals can also be terminals or servers corresponding to different types of institutions, for example: multiple The number of participating terminals is 3, one participating terminal is a terminal or server corresponding to a financial institution, the other participating terminal is a terminal or server corresponding to an insurance institution, and the remaining one participating terminal is a terminal or server corresponding to a medical institution.
机构私密数据为参与端的非共享且经过加密的私密数据,例如:医疗机构的医疗私密数据,金融机构的各项金融业务的私密数据。特征向量信息可为各参与端对于根据机构私密数据对模型进行梯度下降处理时的模型参数对应的梯度信息,也可为各参与端的机构私密数据,在预置业务评估联邦模型对于所有机构私密数据进行某项业务处理时的占比。Institutional private data is non-shared and encrypted private data of the participating end, for example: private medical data of medical institutions, private data of various financial services of financial institutions. The feature vector information can be the gradient information corresponding to the model parameters when each participating terminal performs gradient descent processing on the model according to the private data of the organization, or it can be the private data of the organization of each participating terminal. In the preset business evaluation federal model, the private data of all organizations is Proportion of processing a certain business.
服务器预先根据多个参与端的机构私密数据,构建预置业务评估联邦模型。服务器接收多个参与端分别发送的基于机构私密数据的特征向量信息,调用预置评估器,该预置评估器为深度神经网络,该深度神经网络包括比例选择算法,通过比例选择算法,计算特征向量信息中各特征向量信息的适应度,根据各特征向量信息的适应度,计算各特征向量信息被遗传到下一代群体的累计概率,生成一个[0,1]区间内均匀分布的随机数,根据该随机数对特征向量信息进行选择,对各参与端被选中的多个特征向量信息的概率值进行归一化处理,得到各参与端对应的选择概率。其中,预置评估器用于评估各参与者提供的梯度价值,梯度价值为梯度信息对于模型训练所起作用的程度,或者为各参与者的机构私密数据在预置业务评估联邦模型对于所有机构私密数据的某业务处理方向所起作用的程度。The server constructs a preset business evaluation federation model in advance based on the private data of multiple participating institutions. The server receives the feature vector information based on the organization's private data sent by multiple participating terminals, and calls the preset evaluator. The preset evaluator is a deep neural network. The deep neural network includes a proportional selection algorithm. The feature is calculated through the proportional selection algorithm. The fitness of each feature vector information in the vector information, according to the fitness of each feature vector information, calculate the cumulative probability of each feature vector information being inherited to the next generation population, and generate a uniformly distributed random number in the interval [0,1], The feature vector information is selected according to the random number, and the probability values of the multiple feature vector information selected by each participating terminal are normalized to obtain the selection probability corresponding to each participating terminal. Among them, the preset evaluator is used to evaluate the gradient value provided by each participant. The gradient value is the degree to which the gradient information plays a role in model training, or the private data of each participant’s organization. The federal model is private to all organizations in the preset business evaluation The extent to which a certain business processing direction of data plays a role.
102、通过预置采样器和各参与端对应的选择概率,对特征向量信息进行采样,得到各参与端对应的采样梯度信息。102. Sampling the feature vector information through the preset sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.
其中,特征向量信息为梯度信息,服务器调用预置采样器中的确定性算法,生成梯度信息在[0,1]之间的伪随机数序列,按照各参与端对应的选择概率,对伪随机数序列进行随机采样,得到各参与端对应的采样梯度信息;服务器调用预置采样器按梯度信息的时间顺序将所有的或各参与端对应的梯度信息分类成多份数据,按照各参与端对应的选择概率从每份数据中抽取对应的数据,得到各参与端对应的采样梯度信息;服务器也可调用预置采样器,将所有的梯度信息分类成预设类别(预设类别的数量包括一个或一个以上),得到多个类别的梯度信息,按照各参与端对应的选择概率从每个类别的梯度信息中随机抽取对应的梯度信息,并将抽取的多个类别中的梯度信息组合起来,得到各参与端对应的采样梯度信息。Among them, the feature vector information is gradient information. The server calls the deterministic algorithm in the preset sampler to generate a pseudo-random number sequence with gradient information between [0, 1]. Randomly sample the number sequence to obtain the sampling gradient information corresponding to each participant; the server calls the preset sampler to classify all or the corresponding gradient information of each participant into multiple pieces of data according to the time sequence of the gradient information, corresponding to each participant The selection probability extracts the corresponding data from each piece of data to obtain the sampling gradient information corresponding to each participant; the server can also call the preset sampler to classify all gradient information into preset categories (the number of preset categories includes one Or more than one) to obtain the gradient information of multiple categories, randomly extract the corresponding gradient information from the gradient information of each category according to the corresponding selection probability of each participant, and combine the gradient information in the multiple extracted categories, Obtain the sampling gradient information corresponding to each participant.
103、根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过更新联邦评估模型计算奖励值,奖励值用于指示特征向量信息对于更新联邦评估模型的回报累计值。103. According to the sampling gradient information corresponding to each participating terminal, update the model parameters of the preset business evaluation federation model to obtain the updated federation assessment model, and calculate the reward value through the updated federation evaluation model. The reward value is used to indicate that the feature vector information is effective Update the cumulative return value of the federal assessment model.
服务器将各参与端对应的采样梯度信息进行归一化处理,得到综合梯度信息,根据综合梯度信息,对预置业务评估联邦模型的模型参数进行不断的调整和更新,从而得到更新联邦评估模型;服务器也可通过预置的注意力机制,对各参与端对应的采样梯度信息进行注意力计算,得到各参与端对应的注意力梯度信息,通过各参与端对应的注意力梯度信息,对预置业务评估联邦模型的模型参数进行不断的调整和更新,从而得到更新联邦评估模型,以能够保证所有的采样梯度信息的特征,又能够有所偏重地对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型。The server normalizes the sampling gradient information corresponding to each participating terminal to obtain comprehensive gradient information. According to the comprehensive gradient information, it continuously adjusts and updates the model parameters of the preset business evaluation federation model, thereby obtaining an updated federation evaluation model; The server can also use the preset attention mechanism to perform attention calculation on the sampling gradient information corresponding to each participating terminal to obtain the attention gradient information corresponding to each participating terminal. Through the attention gradient information corresponding to each participating terminal, the preset The model parameters of the business evaluation federation model are continuously adjusted and updated to obtain an updated federation evaluation model to ensure the characteristics of all sampling gradient information and to update the model parameters of the preset business evaluation federation model with a certain degree , Get the updated federal evaluation model.
服务器在接收多个参与端分别的机构私密数据,按照预设比例将机构私密数据划分为机构数据验证集,根据上述获取采样梯度信息的执行过程,获取机构数据验证集对应的采样梯度信息,从而得到特征向量信息的验证集数据,通过验证集数据对更新联邦评估模型进行验证,得到验证联邦评估模型,计算验证联邦评估模型的验证损失值,以及预设时段的移动平均损失值;对验证损失值和移动平均损失值进行差值计算,得到奖励值,其中,奖励值可为预置业务评估联邦模型在训练时的回报总和(可包括收益信号或信息增益),奖励值包括正奖励值和负奖励值,奖励值用于获取最优策略和/或最佳路径,奖励值可为预置业务评估联邦模型对于所有机构私密数据进行某业务处理的最佳策略和/或最佳路径的回报累计值,例如:保险订单金额数据项目分配的最佳策略的回报累计值。The server receives the institutional private data of multiple participating ends, divides the institutional private data into the institutional data verification set according to the preset ratio, and obtains the sampling gradient information corresponding to the institutional data verification set according to the above-mentioned execution process of obtaining sampling gradient information, thereby Obtain the verification set data of the feature vector information, verify the updated federal evaluation model through the verification set data, obtain the verification federal evaluation model, calculate the verification loss value of the verification federal evaluation model, and the moving average loss value of the preset time period; verify the verification loss The difference between the value and the moving average loss value is calculated to obtain the reward value, where the reward value can be the total return of the preset business evaluation federated model during training (which can include the income signal or information gain), and the reward value includes the positive reward value and Negative reward value, the reward value is used to obtain the optimal strategy and/or the best path. The reward value can be the return of the best strategy and/or the best path for the preset business evaluation federation model to process the private data of all institutions for a certain business Cumulative value, for example: the cumulative return value of the best strategy for the allocation of insurance order amount data items.
104、通过预置评估器和奖励值,对特征向量信息进行价值评估,得到各参与端对应的参与者贡献度。104. Perform value evaluation on the feature vector information through the preset evaluator and reward value, and obtain the participant contribution degree corresponding to each participating end.
服务器获取业务贡献度影响因子,通过预置评估器、奖励值和业务贡献度影响因子,对特征向量信息的业务贡献价值进行评估,得到各参与端对应的参与贡献度,该参与者贡献度可为各参与端的机构私密数据对于预置业务评估联邦模型训练时的贡献度,也可为各参与端的机构私密数据在预置业务评估联邦模型对所有机构私密数据有效而准确地进行业务处理时所起的贡献度,例如,在预置业务评估联邦模型对金融收益数据(各即参与端的机构私密数据)进行预设时段的准确预测中,各参与者的机构私密数据所其到的作用,以及对其准确度所起的贡献度。其中,业务贡献度影响因子为计算业务贡献度的影响因素,例如:以业务为训练一个基于联邦学习的保险订单金额数据预测模型为例说明,则业务贡献度影响因子为保险订单的准确性、提供的信息梳理和业务类型的重要度等。The server obtains the business contribution degree impact factor, and evaluates the business contribution value of the feature vector information through the preset evaluator, reward value and business contribution degree impact factor, and obtains the participation contribution degree corresponding to each participant. The participant contribution degree can be Evaluate the contribution of the private data of the participating institutions to the training of the federal model for the preset business, and also evaluate the private data of the participating institutions when the preset business evaluation federal model effectively and accurately process the private data of all institutions. For example, the role of the private institutional data of each participant in the accurate prediction of the financial income data (the private data of the participating institutions) of the preset business evaluation federal model for the preset time period, and Contribution to its accuracy. Among them, the business contribution degree influence factor is the factor that calculates the business contribution degree. For example, taking the business as an example to train an insurance order amount data prediction model based on federal learning, the business contribution degree influence factor is the accuracy of the insurance order, The information provided and the importance of the type of business, etc.
105、根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,得到参与者业务分配信息。105. According to the participant contribution degree corresponding to each participating terminal, perform business distribution on multiple participating terminals to obtain participant business distribution information.
服务器可通过判断各参与端对应的参与者贡献度是否小于预置的目标值,若是,则剔除该参与端,该参与端不参与业务分配,若否,则按照各参与端对应的参与者贡献度,对多个参与端进行业务分配,得到参与者业务分配信息。其中,参与者业务分配信息可为基于机构私密数据的贡献度的收益分配、奖励分配和/或优先级设置等。The server can judge whether the contribution of each participant corresponding to the participant is less than the preset target value, if it is, it will remove the participant, and the participant will not participate in the business distribution, if not, it will be based on the participant's contribution corresponding to each participant. Degree, perform business distribution to multiple participating terminals, and obtain the business distribution information of the participants. Among them, the business distribution information of the participants may be income distribution, reward distribution and/or priority setting based on the contribution of the private data of the organization.
本申请实施例中,通过根据预置评估器输出的选择概率对各基于多个参与端的机构私密数据的特征向量信息进行选择性采用,并根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,通过预置评估器和奖励值,对特征向量信息进行价值评估,根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,降低了计算的复杂度,提高了参与者贡献度评估的准确性,提高了参与者贡献度评估的效率,进而提高了业务分配的准确性。In the embodiment of the present application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset service is determined according to the sampling gradient information corresponding to each participating terminal. The model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business is distributed to the multiple participants, which reduces the calculation Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
请参阅图2,本申请实施例中基于强化学习的业务分配方法的另一个实施例包括:Referring to FIG. 2, another embodiment of the service allocation method based on reinforcement learning in the embodiment of the present application includes:
201、获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对特征向量信息进行选择概率预测,得到各参与端对应的选择概率,预置评估器用于评估各参与端所提供的特征向量信息的梯度价值。201. Obtain feature vector information based on the private data of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal. The preset evaluator is used to evaluate each participating terminal The gradient value of the feature vector information provided.
具体地,服务器分别向多个参与端发送模型梯度计算指令,以使得各参与端根据模型梯度计算指令获取参与端的机构私密数据;通过参与端的机构私密数据,对预置参与者分配模型进行训练,并通过预置的梯度下降算法,计算经过训练后的预置参与者分配模型的参数梯度,得到各参与端对应的特征向量信息,机构私密数据包括医疗机构的医疗私密数据、金融机构的金融业务私密数据和保险机构的保险私密数据中的至少一种;接收各参与端发送的各参与端对应的特征向量信息,并通过预置评估器中的梯度价值函数,对各参与端对应的特征向量信息进行选择概率计算,得到各参与端对应的选择概率。Specifically, the server sends model gradient calculation instructions to multiple participating terminals respectively, so that each participating terminal obtains the private institutional data of the participating terminal according to the model gradient calculation instruction; trains the preset participant allocation model through the private institutional data of the participating terminal, And through the preset gradient descent algorithm, the parameter gradient of the trained preset participant allocation model is calculated, and the corresponding feature vector information of each participant is obtained. The private data of the institution includes the private medical data of the medical institution and the financial business of the financial institution. At least one of private data and insurance private data of insurance institutions; receiving the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculating the corresponding feature vector of each participating terminal through the gradient value function in the preset evaluator Information is calculated for selection probability, and the selection probability corresponding to each participating terminal is obtained.
例如,以机构私密数据为金融业务私密数据为例说明,多个参与端为金融机构1对应的终端1、金融机构2对应的终端2和金融3对应的终端3,服务器分别向终端1、终端2和终端3发送模型梯度计算指令,终端1根据模型梯度计算指令从数据库中提取对应的金融业务私密数据1,并将金融业务私密数据1输入至预置参与者分配模型1,通过预置参与者分配模型1对金融业务私密数据1进行业务分配处理(即训练),通过预置的梯度下降算法,计算经过业务分配处理(即训练)的预置参与者分配模型1的参数梯度,得到终端1对应的特征向量信息1,终端1将特征向量信息1发送至服务器,服务器接收该特征向量信息1后,将该特征向量信息输入梯度价值函数中,通过提取价值函数计算得到终端1对应的选择概率1,梯度价值函数具体如下:w=h θ(δ),其中,w表示选择概率,h θ(δ)为梯度价值函数,δ表示特征向量信息,φ表示可训练参数,依次可得终端2对应的选择概率2和终端3对应的选择概率3。 For example, taking the private data of an organization as the private data of financial business as an example, the multiple participating terminals are the terminal 1 corresponding to the financial institution 1, the terminal 2 corresponding to the financial institution 2, and the terminal 3 corresponding to the financial institution 3. 2 and terminal 3 send model gradient calculation instructions, terminal 1 extracts the corresponding financial business private data 1 from the database according to the model gradient calculation instruction, and inputs the financial business private data 1 to the preset participant allocation model 1, and participates through the preset Participant allocation model 1 performs business allocation processing (ie training) on the financial business private data 1, and calculates the parameter gradient of the preset participant allocation model 1 after business allocation processing (ie training) through the preset gradient descent algorithm, and obtains the terminal 1 corresponds to feature vector information 1. Terminal 1 sends feature vector information 1 to the server. After receiving the feature vector information 1, the server inputs the feature vector information into the gradient value function, and calculates the corresponding selection of terminal 1 by extracting the value function Probability 1, the gradient value function is as follows: w=h θ (δ), where w represents the probability of selection, h θ (δ) is the gradient value function, δ represents the feature vector information, φ represents the trainable parameters, and the terminal can be obtained in turn Selection probability 2 corresponding to 2 and selection probability 3 corresponding to terminal 3.
202、通过预置采样器中的基于多项式分布算法,按照各参与端对应的选择概率,对特征向量信息进行计算,得到各参与端对应的选择向量,特征向量信息包括各参与端的模型梯度信息。202. Through the polynomial-based distribution algorithm in the preset sampler, the feature vector information is calculated according to the selection probability of each participating terminal to obtain the selection vector corresponding to each participating terminal. The feature vector information includes model gradient information of each participating terminal.
服务器通过预置采样器中的基于多项式分布算法中的概率公式,按照各参与端对应的选择概率对各参与端的模型梯度信息进行计算,得到各参与端对应的选择向量ζi=[ζ1,ζ2,ζ3,……ζn],其中,ζi={0,1}且P(ζi=1)=w,P表示概率值,w表示选择概率。The server uses the probability formula based on the polynomial distribution algorithm in the preset sampler to calculate the model gradient information of each participant according to the corresponding selection probability of each participant, and obtains the selection vector ζi=[ζ1,ζ2, ζ3,...ζn], where ζi={0,1} and P(ζi=1)=w, P represents the probability value, and w represents the selection probability.
203、当各参与端对应的选择向量为预设值时,根据各参与端对应的选择向量,对各参与端的模型梯度信息进行采样,得到各参与端对应的采样梯度信息。203. When the selection vector corresponding to each participating terminal is a preset value, sample the model gradient information of each participating terminal according to the selection vector corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.
本实施例中,预设值优选为1,服务器判断各参与端对应的选择向量是否为预设值(即ζi=1),若是,则根据各参与端对应的选择向量对各参与端的模型梯度信息进行随机采样或系统采样或分层采样,从而得到各参与端对应的采样梯度信息;若否,则不对各参与端的模型梯度信息进行采样,循环执行上述获取各参与端对应的选择向量的各执行过程来重新获取选择向量,直至重新获取的选择向量为预设值(即ζi=1),根据重新获取的各参与端对应的选择向量对各参与端的模型梯度信息进行采样,得到各参与端对应的采样梯度信息。In this embodiment, the preset value is preferably 1. The server judges whether the selection vector corresponding to each participant end is the preset value (ie ζi=1), and if so, the model gradient of each participant end according to the selection vector corresponding to each participant end The information is randomly sampled or systematically sampled or stratified sampling to obtain the sampling gradient information corresponding to each participating terminal; if not, the model gradient information of each participating terminal is not sampled, and each of the above methods for obtaining the selection vector corresponding to each participating terminal is performed in a loop. Perform the process to re-acquire the selection vector until the re-acquired selection vector is the preset value (ie ζi = 1), and sample the model gradient information of each participating terminal according to the selection vector corresponding to each participating terminal to obtain each participating terminal Corresponding sampling gradient information.
204、根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过更新联邦评估模型计算奖励值,奖励值用于指示特征向量信息对于更新联邦评估模型的回报累计值。204. According to the sampling gradient information corresponding to each participating terminal, update the model parameters of the preset business evaluation federation model to obtain the updated federation assessment model, and calculate the reward value through the updated federation evaluation model. The reward value is used to indicate that the feature vector information is effective Update the cumulative return value of the federal assessment model.
服务器将各参与端对应的采样梯度信息进行归一化处理,得到综合梯度信息,根据综合梯度信息,对预置业务评估联邦模型的模型参数进行不断的调整和更新,从而得到更新联邦评估模型;服务器也可通过预置的注意力机制,对各参与端对应的采样梯度信息进行 注意力计算,得到各参与端对应的注意力梯度信息,通过各参与端对应的注意力梯度信息,对预置业务评估联邦模型的模型参数进行不断的调整和更新,从而得到更新联邦评估模型,以能够保证所有的采样梯度信息的特征,又能够有所偏重地对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型。The server normalizes the sampling gradient information corresponding to each participating terminal to obtain comprehensive gradient information. According to the comprehensive gradient information, it continuously adjusts and updates the model parameters of the preset business evaluation federation model, thereby obtaining an updated federation evaluation model; The server can also use the preset attention mechanism to perform attention calculation on the sampling gradient information corresponding to each participating terminal, and obtain the attention gradient information corresponding to each participating terminal. Through the attention gradient information corresponding to each participating terminal, the preset The model parameters of the business evaluation federation model are continuously adjusted and updated to obtain an updated federation evaluation model to ensure the characteristics of all sampling gradient information and to update the model parameters of the preset business evaluation federation model with a certain degree , Get the updated federal evaluation model.
具体地,服务器获取特征向量信息的验证集数据,通过更新联邦模型对验证集数据进行验证,得到验证结果;计算验证结果的验证损失值,以及预设时段的移动平均损失值;对验证损失值和移动平均损失值进行差值计算,得到奖励值。Specifically, the server obtains the verification set data of the feature vector information, verifies the verification set data by updating the federated model, and obtains the verification result; calculates the verification loss value of the verification result and the moving average loss value of the preset time period; compares the verification loss value Calculate the difference with the moving average loss value to get the reward value.
服务器在接收多个参与端分别的机构私密数据的特征向量信息时,按照预设比例将机构私密数据的特征向量信息划分为验证集数据,通过更新联邦模型对验证集数据进行验证,得到验证结果,通过预置的损失值计算公式,计算验证结果的验证损失值,预置的损失值计算公式具体如下:
Figure PCTCN2021083817-appb-000001
其中,l v表示验证损失值,v表示数据属于验证集,即验证结果,M表示验证集数据的所有数据项目编号,k表示第k个数据项目,
Figure PCTCN2021083817-appb-000002
表示所需的损失函数,包括均方误差(mean square error,MSE)函数、均方根误差(root-mean-square error,RMSE)函数和交叉熵损失函数等,f θ表示更新联邦评估模型,x表示输入数据,即验证集数据,y表示验证集数据相应的标签;服务器通过预置公式计算预设时段的移动平均损失值,预置公式具体如下:
Figure PCTCN2021083817-appb-000003
l v表示验证损失值,T表示预设时段的移动平均窗口长度,Δ表示预设时段的移动平均基准;将验证损失值减去移动平均损失值,得到奖励值。
When the server receives the feature vector information of the organization's private data of multiple participating ends, it divides the feature vector information of the organization's private data into verification set data according to a preset ratio, and verifies the verification set data by updating the federation model to obtain the verification result , Calculate the verification loss value of the verification result through the preset loss value calculation formula. The preset loss value calculation formula is as follows:
Figure PCTCN2021083817-appb-000001
Among them, l v indicates the verification loss value, v indicates that the data belongs to the verification set, that is, the verification result, M indicates the number of all data items in the verification set data, and k indicates the k-th data item.
Figure PCTCN2021083817-appb-000002
Indicates the required loss function, including the mean square error (MSE) function, root-mean-square error (RMSE) function, and cross-entropy loss function, etc. f θ represents the updated federated evaluation model, x represents the input data, that is, the validation set data, and y represents the corresponding label of the validation set data; the server calculates the moving average loss value for the preset time period through a preset formula, and the preset formula is as follows:
Figure PCTCN2021083817-appb-000003
l v represents the verification loss value, T represents the moving average window length of the preset time period, and Δ represents the moving average benchmark of the preset time period; the verification loss value is subtracted from the moving average loss value to obtain the reward value.
205、通过预置评估器和奖励值,对特征向量信息进行价值评估,得到各参与端对应的参与者贡献度。205. The value of the feature vector information is evaluated through the preset evaluator and the reward value, and the participant contribution degree corresponding to each participating terminal is obtained.
具体地,服务器通过预置的蒙特卡洛策略梯度算法,对奖励值和选择向量进行损失函数计算,得到预置评估器的损失函数;通过预置评估器的损失函数,对预置评估器进行训练,直至损失函数收敛,得到目标评估器;通过目标评估器,对特征向量信息进行价值评估,得到各参与端对应的参与者贡献度。Specifically, the server uses the preset Monte Carlo strategy gradient algorithm to calculate the loss function of the reward value and the selection vector to obtain the loss function of the preset evaluator; through the loss function of the preset evaluator, perform the calculation on the preset evaluator. Training until the loss function converges to obtain the target evaluator; through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution corresponding to each participating end is obtained.
服务器通过蒙特卡洛策略梯度reinforce算法中的计算公式,对奖励值和选择向量进行损失函数计算,得到预置评估器的损失函数,蒙特卡洛策略梯度reinforce算法中的计算公式具体如下:The server uses the calculation formula in the Monte Carlo strategy gradient reinforce algorithm to calculate the loss function of the reward value and the selection vector to obtain the loss function of the preset evaluator. The calculation formula in the Monte Carlo strategy gradient reinforce algorithm is as follows:
Figure PCTCN2021083817-appb-000004
Figure PCTCN2021083817-appb-000004
其中,r表示奖励值,N表示选择向量的数量,i表示第i个选择向量,s i表示选择向量,δ i表示采样梯度信息,h φi)表示梯度价值函数。服务器通过预置评估器的损失函数,对预置评估器的可训练参数φ进行更新,以实现对预置评估器的训练,得到目标评估器,对可训练参数φ更新的公式如下:
Figure PCTCN2021083817-appb-000005
其中,β表示学习率,l h表示预置评估器的损失函数。服务器通过目标评估器,对特征向量信息进行价值评估,得到各参与端对应的参与者贡献度。
Among them, r represents the reward value, N represents the number of selection vectors, i represents the ith selection vector, s i represents the selection vector, δ i represents sampling gradient information, and h φi ) represents the gradient value function. The server updates the trainable parameter φ of the preset evaluator through the loss function of the preset evaluator to realize the training of the preset evaluator and obtain the target evaluator. The formula for updating the trainable parameter φ is as follows:
Figure PCTCN2021083817-appb-000005
Among them, β represents the learning rate, and l h represents the loss function of the preset estimator. The server evaluates the value of the feature vector information through the target evaluator, and obtains the participant contribution degree corresponding to each participant.
206、根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,得到参与者业务分配信息。206. According to the participant contribution degree corresponding to each participating terminal, perform business distribution on multiple participating terminals to obtain participant business distribution information.
具体地,服务器获取各参与端对应的参与者贡献度的贡献度占比值,并判断贡献度占 比值是否大于预设阈值;若贡献度占比值大于预设阈值,则调用预置的分配策略,对多个参与端进行业务分配,得到参与者业务分配信息;若贡献度占比值小于或等于预设阈值,则根据贡献度占比值,对多个参与端进行业务分配,得到参与者业务分配信息。Specifically, the server obtains the contribution percentage value of the participant contribution corresponding to each participating end, and determines whether the contribution percentage value is greater than the preset threshold; if the contribution percentage value is greater than the preset threshold, the preset distribution strategy is invoked, Perform business distribution to multiple participating terminals to obtain the participant’s business distribution information; if the contribution rate is less than or equal to the preset threshold, then according to the contribution rate, the business distribution is performed on multiple participating terminals to obtain the participant’s business distribution information .
其中,服务器判断各参与端对应的参与者贡献度是否小于预置贡献度,若是,则剔除对应的小于预置贡献度的参与者贡献度,计算剔除后的各参与端对应的参与者贡献度分别与总的参与者贡献度和值之间的比例,得到各参与端对应的参与者贡献度的贡献度占比值,若否,则计算各参与端对应的参与者贡献度分别与总的参与者贡献度之间的比例,得到各参与端对应的参与者贡献度的贡献度占比值。Among them, the server judges whether the participant contribution degree corresponding to each participating terminal is less than the preset contribution degree, and if so, eliminates the corresponding participant contribution degree less than the preset contribution degree, and calculates the participant contribution degree corresponding to each participating terminal after the elimination. Respectively and the ratio between the total participant contribution and the value, the contribution ratio of each participant's contribution corresponding to each participant's end is obtained. If not, the participant's contribution corresponding to each participant's end is calculated separately and the total participation The ratio between the contribution of the participants is obtained, and the contribution ratio of the contribution of each participant corresponding to each participating terminal is obtained.
例如,以业务分配为报酬分配,参与者业务分配信息为参与者报酬分配信息为例说明,该报酬分配对应各参与端的特征向量信息对更新联邦评估模型的更新的贡献度,各参与端对应的参与者贡献度分别为0.03(参与端1)、0.24(参与端2)、0.40(参与端3)和0.33(参与端4),预置贡献度为0.20,预设阈值为0.40,报酬分配的总金额为100万,则剔除0.03,得到参与者业务分配信息1(参与端1获得报酬0元),并得到0.24(参与端2)、0.40(参与端3)和0.33(参与端4)分别对应的参与者贡献度的贡献度占比值为0.247(参与端2)、0.412(参与端3)和0.34(参与端4),仅有0.412大于预设阈值0.40,则调用预置的分配策略对参与端3进行分配(100万*0.412+10万=51.2万),得到参与者业务分配信息3(参与端3获得报酬51.2元),该预置的分配策略为除了按照报酬分配的总金额中贡献度占比值对应的金额外,还额外分配10万,则根据贡献度占比值,对参与端2和参与端4进行业务分配(参与端2=0.247*100万=24.7万,参与端4=0.34*100万=34.0万),得到参与者业务分配信息2(参与端2获得报酬24.7元)和参与者业务分配信息4(参与端4获得报酬34.0元)。For example, take the business distribution as the remuneration distribution, and the participant’s business distribution information is the participant’s remuneration distribution information. Participant contributions are 0.03 (participating terminal 1), 0.24 (participating terminal 2), 0.40 (participating terminal 3), and 0.33 (participating terminal 4). The preset contribution rate is 0.20, the preset threshold is 0.40, and the reward is distributed. The total amount is 1 million, then 0.03 is removed, and the participant's business distribution information 1 (participating end 1 gets paid 0 yuan), and 0.24 (participating end 2), 0.40 (participating end 3) and 0.33 (participating end 4) are obtained respectively The contribution of the corresponding participant contribution is 0.247 (participating end 2), 0.412 (participating end 3) and 0.34 (participating end 4). Only 0.412 is greater than the preset threshold 0.40, then the preset allocation strategy pair is called Participating terminal 3 distributes (1 million * 0.412 + 100,000 = 512,000) to obtain participant business distribution information 3 (participating terminal 3 gets a reward of 51.2 yuan). The preset distribution strategy is in addition to the total amount distributed according to the reward The amount of money corresponding to the contribution percentage value is extra, and an additional 100,000 is allocated. According to the contribution percentage value, the business distribution of the participant 2 and the participant 4 is carried out (participant 2 = 0.247*1 million = 247,000, and participant 4 = 0.34*1 million = 34 million), and obtain participant business distribution information 2 (participating terminal 2 receives a remuneration of 24.7 yuan) and participant business distribution information 4 (participating terminal 4 receives a remuneration of 34.0 yuan).
具体地,服务器根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,得到参与者业务分配信息之后,还获取参与者业务分配信息的异常信息,根据异常信息对参与者业务分配信息进行更新,并对各参与端对应的选择概率的确定策略进行优化。Specifically, the server distributes services to multiple participating terminals according to the participant contribution degree corresponding to each participating terminal. After obtaining the participant's business distribution information, it also obtains the abnormal information of the participant's business distribution information, and performs business assignments to the participants based on the abnormal information. The allocation information is updated, and the selection probability determination strategy corresponding to each participating terminal is optimized.
服务器将参与者业务分配信息加密后发送至审核端,由审核端对参与者业务分配信息进行解密和审核,若该参与者业务分配信息存在异常信息,则将异常信息反馈服务器,服务器根据异常信息匹配对应的优化机制,该优化机制包括优化算法、优化策略、优化的执行过程和优化的执行脚本,通过优化机制对参与者业务分配信息中的异常信息进行修正(更新),并通过优化机制对各参与端对应的选择概率的确定策略进行优化,其中,确定策略包括各参与端对应的选择概率的模型选取、模型计算和特征向量信息的选取等,通过根据异常信息对参与者业务分配信息进行更新,并对各参与端对应的选择概率的确定策略进行优化,提升了基于强化学习的业务分配方法的执行过程的计算便捷度、计算准确度和计算效率,进而提高了业务分配的准确性。The server encrypts the participant’s business distribution information and sends it to the auditing side. The auditing side decrypts and audits the participant’s business distribution information. If there is abnormal information in the participant’s business distribution information, the abnormal information will be fed back to the server, and the server will respond to the abnormal information. Match the corresponding optimization mechanism. The optimization mechanism includes optimization algorithms, optimization strategies, optimized execution processes and optimized execution scripts. The abnormal information in the participant’s business allocation information is corrected (updated) through the optimization mechanism, and the optimization mechanism is used to correct The selection probability determination strategy corresponding to each participating terminal is optimized. The determination strategy includes model selection, model calculation, and feature vector information selection of the selection probability corresponding to each participating terminal. Update and optimize the determination strategy of the selection probability corresponding to each participating terminal, which improves the calculation convenience, calculation accuracy and calculation efficiency of the execution process of the business allocation method based on reinforcement learning, thereby improving the accuracy of business allocation.
本申请实施例中,通过根据预置评估器输出的选择概率对各基于多个参与端的机构私密数据的特征向量信息进行选择性采用,并根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,通过预置评估器和奖励值,对特征向量信息进行价值评估,根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,降低了计算的复杂度,提高了参与者贡献度评估的准确性,提高了参与者贡献度评估的效率,进而提高了业务分配的准确性。In the embodiment of the present application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset business is determined according to the sampling gradient information corresponding to each participating terminal. The model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business is distributed to the multiple participants, which reduces the calculation Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
上面对本申请实施例中基于强化学习的业务分配方法进行了描述,下面对本申请实施例中基于强化学习的业务分配装置进行描述,请参阅图3,本申请实施例中基于强化学习的业务分配装置一个实施例包括:The above describes the service distribution method based on reinforcement learning in the embodiment of this application. The following describes the service distribution device based on reinforcement learning in the embodiment of this application. Please refer to FIG. 3, the service distribution device based on reinforcement learning in the embodiment of this application One embodiment includes:
预测模块301,用于获取基于多个参与端的机构私密数据的特征向量信息,并通过预 置评估器,对特征向量信息进行选择概率预测,得到各参与端对应的选择概率,预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;The prediction module 301 is used to obtain the feature vector information based on the private data of the institutions of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal. The preset evaluator is used for Evaluate the gradient value of the feature vector information provided by each participant;
采样模块302,用于通过预置采样器和各参与端对应的选择概率,对特征向量信息进行采样,得到各参与端对应的采样梯度信息;The sampling module 302 is configured to sample the feature vector information through the preset sampler and the corresponding selection probability of each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
更新模块303,用于根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过更新联邦评估模型计算奖励值,奖励值用于指示特征向量信息对于更新联邦评估模型的回报累计值;The update module 303 is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to each participating terminal to obtain the updated federation evaluation model, and calculate the reward value through the updated federation evaluation model, and the reward value is used to indicate The cumulative value of eigenvector information for the update of the federal evaluation model;
评估模块304,用于通过预置评估器和奖励值,对特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;The evaluation module 304 is used to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;
分配模块305,用于根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,得到参与者业务分配信息。The distribution module 305 is used to distribute services to multiple participating terminals according to the participant contribution corresponding to each participating terminal to obtain participant service distribution information.
上述基于强化学习的业务分配装置中各个模块的功能实现与上述基于强化学习的业务分配方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。The function realization of each module in the above-mentioned reinforcement learning-based service distribution device corresponds to the steps in the above-mentioned embodiment of the above-mentioned reinforcement learning-based service distribution method, and the functions and realization processes thereof will not be repeated here.
本申请实施例中,通过根据预置评估器输出的选择概率对各基于多个参与端的机构私密数据的特征向量信息进行选择性采用,并根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,通过预置评估器和奖励值,对特征向量信息进行价值评估,根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,降低了计算的复杂度,提高了参与者贡献度评估的准确性,提高了参与者贡献度评估的效率,进而提高了业务分配的准确性。In the embodiment of the present application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset service is determined according to the sampling gradient information corresponding to each participating terminal. The model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business is distributed to the multiple participants, which reduces the calculation Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
请参阅图4,本申请实施例中基于强化学习的业务分配装置的另一个实施例包括:Referring to FIG. 4, another embodiment of a service distribution apparatus based on reinforcement learning in an embodiment of the present application includes:
预测模块301,用于获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对特征向量信息进行选择概率预测,得到各参与端对应的选择概率,预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;The prediction module 301 is used to obtain the feature vector information based on the private data of the institutions of multiple participating terminals, and use a preset evaluator to predict the selection probability of the feature vector information to obtain the corresponding selection probability of each participating terminal. The preset evaluator is used for Evaluate the gradient value of the feature vector information provided by each participant;
采样模块302,用于通过预置采样器和各参与端对应的选择概率,对特征向量信息进行采样,得到各参与端对应的采样梯度信息;The sampling module 302 is configured to sample the feature vector information through the preset sampler and the corresponding selection probability of each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
其中,采样模块302具体包括:Among them, the sampling module 302 specifically includes:
计算单元3021,用于通过预置采样器中的基于多项式分布算法,按照各参与端对应的选择概率,对特征向量信息进行计算,得到各参与端对应的选择向量,特征向量信息包括各参与端的模型梯度信息;The calculation unit 3021 is configured to calculate the feature vector information according to the selection probability corresponding to each participating terminal through the polynomial distribution algorithm in the preset sampler to obtain the selection vector corresponding to each participating terminal. The feature vector information includes the corresponding selection vector of each participating terminal. Model gradient information;
采样单元3022,用于当各参与端对应的选择向量为预设值时,根据各参与端对应的选择向量,对各参与端的模型梯度信息进行采样,得到各参与端对应的采样梯度信息;The sampling unit 3022 is configured to sample the model gradient information of each participating terminal according to the selection vector corresponding to each participating terminal when the selection vector corresponding to each participating terminal is a preset value to obtain sampling gradient information corresponding to each participating terminal;
更新模块303,用于根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过更新联邦评估模型计算奖励值,奖励值用于指示特征向量信息对于更新联邦评估模型的回报累计值;The update module 303 is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to each participating terminal to obtain the updated federation evaluation model, and calculate the reward value through the updated federation evaluation model, and the reward value is used to indicate The cumulative value of eigenvector information for the update of the federal evaluation model;
评估模块304,用于通过预置评估器和奖励值,对特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;The evaluation module 304 is used to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;
分配模块305,用于根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,得到参与者业务分配信息。The allocation module 305 is used to allocate services to multiple participating terminals according to the participant contribution corresponding to each participating terminal to obtain the participant's business distribution information.
可选的,评估模块304还可以具体用于:Optionally, the evaluation module 304 can also be specifically used for:
通过预置的蒙特卡洛策略梯度算法,对奖励值和选择向量进行损失函数计算,得到预置评估器的损失函数;Through the preset Monte Carlo strategy gradient algorithm, calculate the loss function of the reward value and the selection vector to obtain the loss function of the preset estimator;
通过预置评估器的损失函数,对预置评估器进行训练,直至损失函数收敛,得到目标评估器;Train the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain the target evaluator;
通过目标评估器,对特征向量信息进行价值评估,得到各参与端对应的参与者贡献度。Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.
可选的,预测模块301还可以具体用于:Optionally, the prediction module 301 may also be specifically used to:
分别向多个参与端发送模型梯度计算指令,以使得各参与端根据模型梯度计算指令获取参与端的机构私密数据;Sending model gradient calculation instructions to multiple participating terminals respectively, so that each participating terminal obtains the private data of the participating organizations according to the model gradient calculation instruction;
通过参与端的机构私密数据,对预置参与者分配模型进行训练,并通过预置的梯度下降算法,计算经过训练后的预置参与者分配模型的参数梯度,得到各参与端对应的特征向量信息,机构私密数据包括医疗机构的医疗私密数据、金融机构的金融业务私密数据和保险机构的保险私密数据中的至少一种;Through the private data of the participating institutions, the preset participant allocation model is trained, and the preset gradient descent algorithm is used to calculate the parameter gradient of the trained preset participant allocation model to obtain the corresponding feature vector information of each participant , Institutional private data includes at least one of medical private data of medical institutions, financial business private data of financial institutions, and insurance private data of insurance institutions;
接收各参与端发送的各参与端对应的特征向量信息,并通过预置评估器中的梯度价值函数,对各参与端对应的特征向量信息进行选择概率计算,得到各参与端对应的选择概率。The feature vector information corresponding to each participating terminal is received, and the selection probability of the feature vector information corresponding to each participating terminal is calculated through the gradient value function in the preset evaluator to obtain the selection probability corresponding to each participating terminal.
可选的,更新模块303还可以具体用于:Optionally, the update module 303 can also be specifically used for:
获取特征向量信息的验证集数据,通过更新联邦模型对验证集数据进行验证,得到验证结果;Obtain the verification set data of the feature vector information, verify the verification set data by updating the federation model, and obtain the verification result;
计算验证结果的验证损失值,以及预设时段的移动平均损失值;Calculate the verification loss value of the verification result and the moving average loss value of the preset time period;
对验证损失值和移动平均损失值进行差值计算,得到奖励值。The difference between the verification loss value and the moving average loss value is calculated to obtain the reward value.
可选的,分配模块305还可以具体用于:Optionally, the allocation module 305 may also be specifically used for:
获取各参与端对应的参与者贡献度的贡献度占比值,并判断贡献度占比值是否大于预设阈值;Obtain the contribution percentage value of the participant's contribution corresponding to each participating terminal, and determine whether the contribution percentage value is greater than a preset threshold;
若贡献度占比值大于预设阈值,则调用预置的分配策略,对多个参与端进行业务分配,得到参与者业务分配信息;If the contribution ratio is greater than the preset threshold, the preset distribution strategy is invoked to distribute services to multiple participating terminals to obtain the participant's service distribution information;
若贡献度占比值小于或等于预设阈值,则根据贡献度占比值,对多个参与端进行业务分配,得到参与者业务分配信息。If the contribution rate is less than or equal to the preset threshold value, then according to the contribution rate, business distribution is performed on multiple participating terminals to obtain participant business distribution information.
可选的,基于强化学习的业务分配装置,还包括:Optionally, a service distribution device based on reinforcement learning also includes:
更新优化模块306,用于获取参与者业务分配信息的异常信息,根据异常信息对参与者业务分配信息进行更新,并对各参与端对应的选择概率的确定策略进行优化。The update optimization module 306 is used to obtain abnormal information of the participant's business allocation information, update the participant's business allocation information according to the abnormal information, and optimize the determination strategy of the selection probability corresponding to each participant.
上述基于强化学习的业务分配装置中各模块和各单元的功能实现与上述基于强化学习的业务分配方法实施例中各步骤相对应,其功能和实现过程在此处不再一一赘述。The function realization of each module and each unit in the above-mentioned reinforcement learning-based service distribution device corresponds to each step in the above-mentioned embodiment of the above-mentioned service distribution method based on reinforcement learning, and the functions and realization processes thereof will not be repeated here.
本申请实施例中,通过根据预置评估器输出的选择概率对各基于多个参与端的机构私密数据的特征向量信息进行选择性采用,并根据各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,通过预置评估器和奖励值,对特征向量信息进行价值评估,根据各参与端对应的参与者贡献度,对多个参与端进行业务分配,降低了计算的复杂度,提高了参与者贡献度评估的准确性,提高了参与者贡献度评估的效率,进而提高了业务分配的准确性。In the embodiment of the present application, the feature vector information of each organization's private data based on multiple participating terminals is selectively adopted according to the selection probability output by the preset evaluator, and the preset service is determined according to the sampling gradient information corresponding to each participating terminal. The model parameters of the evaluation federation model are updated, and the value of the feature vector information is evaluated through the preset evaluator and reward value. According to the contribution of each participant corresponding to the participant, the business distribution of multiple participants is performed, which reduces the computational cost. Complexity improves the accuracy of participant contribution evaluation, improves the efficiency of participant contribution evaluation, and further improves the accuracy of business allocation.
上面图3和图4从模块化功能实体的角度对本申请实施例中的基于强化学习的业务分配装置进行详细描述,下面从硬件处理的角度对本申请实施例中基于强化学习的业务分配设备进行详细描述。The above Figures 3 and 4 describe in detail the reinforcement learning-based service distribution device in this embodiment of the application from the perspective of modular functional entities. The following describes the reinforcement learning-based service distribution device in this embodiment of the application in detail from the perspective of hardware processing. describe.
图5是本申请实施例提供的一种基于强化学习的业务分配设备的结构示意图,该基于强化学习的业务分配设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对基于强化学习的业务分配设备500中的一系列指令操作。更进一步地,处理器510可以 设置为与存储介质530通信,在基于强化学习的业务分配设备500上执行存储介质530中的一系列指令操作。FIG. 5 is a schematic structural diagram of a service distribution device based on reinforcement learning provided by an embodiment of the present application. The service distribution device 500 based on reinforcement learning may have relatively large differences due to different configurations or performances, and may include one or more A processor (central processing units, CPU) 510 (for example, one or more processors), a memory 520, and one or more storage media 530 (for example, one or one storage device with a large amount of storage) storing application programs 533 or data 532. Among them, the memory 520 and the storage medium 530 may be short-term storage or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the service distribution device 500 based on reinforcement learning. Further, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the service distribution device 500 based on reinforcement learning.
基于强化学习的业务分配设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的基于强化学习的业务分配设备结构并不构成对基于强化学习的业务分配设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The service distribution device 500 based on reinforcement learning may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or one or more operating systems 531, For example, Windows Serve, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the service distribution device based on reinforcement learning shown in FIG. 5 does not constitute a limitation on the service distribution device based on reinforcement learning, and may include more or less components than shown in the figure, or a combination of certain components. Some components, or different component arrangements.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,计算机可读存储介质中存储有指令,当指令在计算机上运行时,使得计算机执行基于强化学习的业务分配方法的步骤。This application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the read storage medium, and when the instructions are run on the computer, the computer executes the steps of the business distribution method based on reinforcement learning.
进一步地,计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store information according to the blockchain node Use the created data, etc.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and conciseness of the description, the specific working process of the above-described system, device, and unit can refer to the corresponding process in the foregoing method embodiment, which will not be repeated here.
集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium. Based on this understanding, the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which can be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods in the various embodiments of the present application. The aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disks or optical disks and other media that can store program codes. .
以上,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。Above, the above embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing various implementations. The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application.

Claims (20)

  1. 一种基于强化学习的业务分配方法,其中,所述基于强化学习的业务分配方法包括:A business allocation method based on reinforcement learning, wherein the business allocation method based on reinforcement learning includes:
    获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,所述预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
    通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息;Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
    根据所述各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过所述更新联邦评估模型计算奖励值,所述奖励值用于指示所述特征向量信息对于所述更新联邦评估模型的回报累计值;According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
    通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;
    根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息。According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
  2. 根据权利要求1所述的基于强化学习的业务分配方法,其中,所述通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息,包括:The service distribution method based on reinforcement learning according to claim 1, wherein the feature vector information is sampled by the preset sampler and the selection probability corresponding to each participating terminal to obtain the corresponding selection probability of each participating terminal Sampling gradient information, including:
    通过预置采样器中的基于多项式分布算法,按照所述各参与端对应的选择概率,对所述特征向量信息进行计算,得到各参与端对应的选择向量,所述特征向量信息包括各参与端的模型梯度信息;Through the polynomial-based distribution algorithm in the preset sampler, the feature vector information is calculated according to the selection probability corresponding to each participating terminal to obtain the selection vector corresponding to each participating terminal. The feature vector information includes the Model gradient information;
    当所述各参与端对应的选择向量为预设值时,根据所述各参与端对应的选择向量,对所述各参与端的模型梯度信息进行采样,得到各参与端对应的采样梯度信息。When the selection vector corresponding to each participating terminal is a preset value, the model gradient information of each participating terminal is sampled according to the selection vector corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.
  3. 根据权利要求2所述的基于强化学习的业务分配方法,其中,所述通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度,包括:The method of business distribution based on reinforcement learning according to claim 2, wherein the value evaluation of the feature vector information is performed through the preset evaluator and the reward value to obtain the participant corresponding to each participating end Contribution, including:
    通过预置的蒙特卡洛策略梯度算法,对所述奖励值和所述选择向量进行损失函数计算,得到所述预置评估器的损失函数;Perform a loss function calculation on the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain the loss function of the preset evaluator;
    通过所述预置评估器的损失函数,对所述预置评估器进行训练,直至所述损失函数收敛,得到目标评估器;Training the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain a target evaluator;
    通过所述目标评估器,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度。Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.
  4. 根据权利要求1所述的基于强化学习的业务分配方法,其中,所述获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,包括:The method of business distribution based on reinforcement learning according to claim 1, wherein said acquiring feature vector information based on the private data of institutions of multiple participating terminals, and predicting the selection probability of said feature vector information through a preset evaluator , Get the selection probability corresponding to each participating terminal, including:
    分别向多个参与端发送模型梯度计算指令,以使得各参与端根据所述模型梯度计算指令获取参与端的机构私密数据;Respectively sending model gradient calculation instructions to multiple participating terminals, so that each participating terminal obtains the private data of the organization of the participating terminal according to the model gradient calculation instruction;
    通过所述参与端的机构私密数据,对预置参与者分配模型进行训练,并通过预置的梯度下降算法,计算经过训练后的预置参与者分配模型的参数梯度,得到各参与端对应的特征向量信息,所述机构私密数据包括医疗机构的医疗私密数据、金融机构的金融业务私密数据和保险机构的保险私密数据中的至少一种;Train the preset participant allocation model through the private institutional data of the participating end, and calculate the parameter gradient of the trained preset participant allocation model through the preset gradient descent algorithm to obtain the corresponding characteristics of each participant end Vector information, where the private data of the institution includes at least one of private medical data of a medical institution, private financial data of a financial institution, and private insurance data of an insurance institution;
    接收各参与端发送的各参与端对应的特征向量信息,并通过预置评估器中的梯度价值函数,对所述各参与端对应的特征向量信息进行选择概率计算,得到各参与端对应的选择概率。Receive the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculate the selection probability of the feature vector information corresponding to each participating terminal through the gradient value function in the preset evaluator, to obtain the corresponding selection of each participating terminal Probability.
  5. 根据权利要求1所述的基于强化学习的业务分配方法,其中,所述通过所述更新联邦评估模型计算奖励值,包括:The method of business distribution based on reinforcement learning according to claim 1, wherein said calculating a reward value through said updated federated evaluation model comprises:
    获取所述特征向量信息的验证集数据,通过所述更新联邦模型对所述验证集数据进行验证,得到验证结果;Acquiring verification set data of the feature vector information, verifying the verification set data through the updated federated model, to obtain a verification result;
    计算所述验证结果的验证损失值,以及预设时段的移动平均损失值;Calculating the verification loss value of the verification result and the moving average loss value of the preset time period;
    对所述验证损失值和所述移动平均损失值进行差值计算,得到奖励值。A difference calculation is performed on the verification loss value and the moving average loss value to obtain a reward value.
  6. 根据权利要求1所述的基于强化学习的业务分配方法,其中,所述根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息,包括:The method for business distribution based on reinforcement learning according to claim 1, wherein said performing business distribution on said multiple participating terminals according to the participant contribution degree corresponding to said participating terminals to obtain participant business distribution information, include:
    获取所述各参与端对应的参与者贡献度的贡献度占比值,并判断所述贡献度占比值是否大于预设阈值;Acquiring the contribution percentage value of the participant contribution corresponding to each participating terminal, and determining whether the contribution percentage value is greater than a preset threshold;
    若所述贡献度占比值大于预设阈值,则调用预置的分配策略,对所述多个参与端进行业务分配,得到第一参与者业务分配信息;If the contribution ratio is greater than the preset threshold, call a preset distribution strategy to distribute services to the multiple participating terminals to obtain service distribution information of the first participant;
    若所述贡献度占比值小于或等于预设阈值,则根据所述贡献度占比值,对所述多个参与端进行业务分配,得到第二参与者业务分配信息。If the contribution percentage value is less than or equal to the preset threshold value, then according to the contribution percentage value, service distribution is performed on the multiple participating terminals to obtain the second participant's service distribution information.
  7. 根据权利要求1-6中任一项所述的基于强化学习的业务分配方法,其中,所述根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息之后,还包括:The service distribution method based on reinforcement learning according to any one of claims 1-6, wherein the service distribution is performed on the multiple participating terminals according to the participant contribution degree corresponding to each participating terminal to obtain After the participant's business distribution information, it also includes:
    获取所述参与者业务分配信息的异常信息,根据所述异常信息对所述参与者业务分配信息进行更新,并对所述各参与端对应的选择概率的确定策略进行优化。Obtain the abnormal information of the participant's service distribution information, update the participant's service distribution information according to the abnormal information, and optimize the determination strategy of the selection probability corresponding to each participant.
  8. 一种基于强化学习的业务分配设备,其中,所述基于强化学习的业务分配设备包括:存储器、处理器以及存储在所述存储器上并可在所述处理器上运行的基于强化学习的业务分配程序,所述处理器执行所述基于强化学习的业务分配程序时实现如下步骤:A service distribution device based on reinforcement learning, wherein the service distribution device based on reinforcement learning includes: a memory, a processor, and reinforcement learning-based service distribution stored in the memory and running on the processor A program, the processor implements the following steps when executing the business allocation program based on reinforcement learning:
    获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,所述预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
    通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息;Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
    根据所述各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过所述更新联邦评估模型计算奖励值,所述奖励值用于指示所述特征向量信息对于所述更新联邦评估模型的回报累计值;According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
    通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;
    根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息。According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
  9. 如权利要求8所述的基于强化学习的业务分配设备,其中,所述处理器执行所述基于强化学习的业务分配程序实现所述通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息时,包括以下步骤:8. The service distribution device based on reinforcement learning according to claim 8, wherein the processor executes the service distribution program based on reinforcement learning to realize the selection probabilities corresponding to the respective participating terminals through the preset sampler, When sampling the feature vector information to obtain sampling gradient information corresponding to each participating terminal, the following steps are included:
    通过预置采样器中的基于多项式分布算法,按照所述各参与端对应的选择概率,对所述特征向量信息进行计算,得到各参与端对应的选择向量,所述特征向量信息包括各参与端的模型梯度信息;Through the polynomial-based distribution algorithm in the preset sampler, the feature vector information is calculated according to the selection probability corresponding to each participating terminal to obtain the selection vector corresponding to each participating terminal. The feature vector information includes the Model gradient information;
    当所述各参与端对应的选择向量为预设值时,根据所述各参与端对应的选择向量,对所述各参与端的模型梯度信息进行采样,得到各参与端对应的采样梯度信息。When the selection vector corresponding to each participating terminal is a preset value, the model gradient information of each participating terminal is sampled according to the selection vector corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.
  10. 如权利要求9所述的基于强化学习的业务分配设备,其中,所述处理器执行所述基于强化学习的业务分配程序实现所述通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度时,包括以下步骤:9. The service distribution device based on reinforcement learning according to claim 9, wherein the processor executes the service distribution program based on reinforcement learning to realize the use of the preset evaluator and the reward value for the When evaluating the value of the feature vector information and obtaining the participant contribution corresponding to each participating end, the following steps are included:
    通过预置的蒙特卡洛策略梯度算法,对所述奖励值和所述选择向量进行损失函数计算,得到所述预置评估器的损失函数;Perform a loss function calculation on the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain the loss function of the preset evaluator;
    通过所述预置评估器的损失函数,对所述预置评估器进行训练,直至所述损失函数收敛,得到目标评估器;Training the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain a target evaluator;
    通过所述目标评估器,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度。Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.
  11. 如权利要求8所述的基于强化学习的业务分配设备,其中,所述处理器执行所述基于强化学习的业务分配程序实现所述获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率时,包括以下步骤:The service distribution device based on reinforcement learning according to claim 8, wherein the processor executes the service distribution program based on reinforcement learning to realize the acquisition of feature vector information based on the private data of institutions of multiple participating terminals, and through The preset evaluator to predict the selection probability of the feature vector information and obtain the selection probability corresponding to each participating terminal includes the following steps:
    分别向多个参与端发送模型梯度计算指令,以使得各参与端根据所述模型梯度计算指令获取参与端的机构私密数据;Respectively sending model gradient calculation instructions to multiple participating terminals, so that each participating terminal obtains the private data of the organization of the participating terminal according to the model gradient calculation instruction;
    通过所述参与端的机构私密数据,对预置参与者分配模型进行训练,并通过预置的梯度下降算法,计算经过训练后的预置参与者分配模型的参数梯度,得到各参与端对应的特征向量信息,所述机构私密数据包括医疗机构的医疗私密数据、金融机构的金融业务私密数据和保险机构的保险私密数据中的至少一种;Train the preset participant allocation model through the private institutional data of the participating end, and calculate the parameter gradient of the trained preset participant allocation model through the preset gradient descent algorithm to obtain the corresponding characteristics of each participant end Vector information, where the private data of the institution includes at least one of private medical data of a medical institution, private financial data of a financial institution, and private insurance data of an insurance institution;
    接收各参与端发送的各参与端对应的特征向量信息,并通过预置评估器中的梯度价值函数,对所述各参与端对应的特征向量信息进行选择概率计算,得到各参与端对应的选择概率。Receive the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculate the selection probability of the feature vector information corresponding to each participating terminal through the gradient value function in the preset evaluator, to obtain the corresponding selection of each participating terminal Probability.
  12. 如权利要求8所述的基于强化学习的业务分配设备,其中,所述处理器执行所述基于强化学习的业务分配程序实现所述通过所述更新联邦评估模型计算奖励值时,包括以下步骤:8. The reinforcement learning-based service distribution device according to claim 8, wherein when the processor executes the reinforcement learning-based service distribution program to realize the calculation of the reward value through the updated federated evaluation model, the method comprises the following steps:
    获取所述特征向量信息的验证集数据,通过所述更新联邦模型对所述验证集数据进行验证,得到验证结果;Acquiring verification set data of the feature vector information, verifying the verification set data through the updated federated model, to obtain a verification result;
    计算所述验证结果的验证损失值,以及预设时段的移动平均损失值;Calculating the verification loss value of the verification result and the moving average loss value of the preset time period;
    对所述验证损失值和所述移动平均损失值进行差值计算,得到奖励值。A difference calculation is performed on the verification loss value and the moving average loss value to obtain a reward value.
  13. 如权利要求8所述的基于强化学习的业务分配设备,其中,所述处理器执行所述基于强化学习的业务分配程序实现所述根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息时,包括以下步骤:8. The reinforcement learning-based service distribution device according to claim 8, wherein the processor executes the reinforcement learning-based service distribution program to realize the contribution to the participant according to the participant contribution corresponding to each participant end When multiple participating terminals perform business distribution and obtain the participant's business distribution information, the following steps are included:
    获取所述各参与端对应的参与者贡献度的贡献度占比值,并判断所述贡献度占比值是否大于预设阈值;Acquiring the contribution percentage value of the participant contribution corresponding to each participating terminal, and determining whether the contribution percentage value is greater than a preset threshold;
    若所述贡献度占比值大于预设阈值,则调用预置的分配策略,对所述多个参与端进行业务分配,得到第一参与者业务分配信息;If the contribution ratio is greater than the preset threshold, call a preset distribution strategy to distribute services to the multiple participating terminals to obtain service distribution information of the first participant;
    若所述贡献度占比值小于或等于预设阈值,则根据所述贡献度占比值,对所述多个参与端进行业务分配,得到第二参与者业务分配信息。If the contribution percentage value is less than or equal to the preset threshold value, then according to the contribution percentage value, service distribution is performed on the multiple participating terminals to obtain the second participant's service distribution information.
  14. 如权利要求8-13所述的基于强化学习的业务分配设备,其中,所述处理器执行所述基于强化学习的业务分配程序实现所述根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息之后,包括以下步骤:The service distribution device based on reinforcement learning according to claims 8-13, wherein the processor executes the service distribution program based on reinforcement learning to realize the contribution of the participants according to the respective participating terminals. After the multiple participating terminals perform service distribution and obtain participant service distribution information, the following steps are included:
    获取所述参与者业务分配信息的异常信息,根据所述异常信息对所述参与者业务分配信息进行更新,并对所述各参与端对应的选择概率的确定策略进行优化。Obtain the abnormal information of the participant's service distribution information, update the participant's service distribution information according to the abnormal information, and optimize the determination strategy of the selection probability corresponding to each participant.
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium in which computer instructions are stored, and when the computer instructions are executed on a computer, the computer executes the following steps:
    获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,所述预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;Obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset evaluator is used to evaluate each The gradient value of the feature vector information provided by the participant;
    通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息;Sampling the feature vector information by presetting the sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
    根据所述各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过所述更新联邦评估模型计算奖励值,所述奖励值用于指示所述特征向量信息对于所述更新联邦评估模型的回报累计值;According to the sampling gradient information corresponding to each participating terminal, the model parameters of the preset business evaluation federation model are updated to obtain the updated federation evaluation model, and the reward value is calculated through the updated federation evaluation model, and the reward value is used to indicate The cumulative return value of the feature vector information to the updated federal evaluation model;
    通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;Perform value evaluation on the feature vector information through the preset evaluator and the reward value to obtain the participant contribution degree corresponding to each participating terminal;
    根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息。According to the participant contribution degree corresponding to each participating terminal, service distribution is performed on the multiple participating terminals to obtain participant service distribution information.
  16. 如权利要求15所述的计算机可读存储介质,所述计算机可读存储介质执行所述计算机指令实现所述通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息时,包括以下步骤:15. The computer-readable storage medium according to claim 15, wherein the computer-readable storage medium executes the computer instructions to realize the selection probabilities corresponding to the respective participating terminals through the preset sampler, and to compare the feature vector information When sampling to obtain the sampling gradient information corresponding to each participant, the following steps are included:
    通过预置采样器中的基于多项式分布算法,按照所述各参与端对应的选择概率,对所述特征向量信息进行计算,得到各参与端对应的选择向量,所述特征向量信息包括各参与端的模型梯度信息;Through the polynomial-based distribution algorithm in the preset sampler, the feature vector information is calculated according to the selection probability corresponding to each participating terminal to obtain the selection vector corresponding to each participating terminal. The feature vector information includes the Model gradient information;
    当所述各参与端对应的选择向量为预设值时,根据所述各参与端对应的选择向量,对所述各参与端的模型梯度信息进行采样,得到各参与端对应的采样梯度信息。When the selection vector corresponding to each participating terminal is a preset value, the model gradient information of each participating terminal is sampled according to the selection vector corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal.
  17. 如权利要求16所述的计算机可读存储介质,所述计算机可读存储介质执行所述计算机指令实现所述通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度时,包括以下步骤:The computer-readable storage medium according to claim 16, the computer-readable storage medium executes the computer instructions to realize the value evaluation of the feature vector information through the preset evaluator and the reward value , When obtaining the participant contribution degree corresponding to each participating terminal, the following steps are included:
    通过预置的蒙特卡洛策略梯度算法,对所述奖励值和所述选择向量进行损失函数计算,得到所述预置评估器的损失函数;Perform a loss function calculation on the reward value and the selection vector through a preset Monte Carlo strategy gradient algorithm to obtain the loss function of the preset evaluator;
    通过所述预置评估器的损失函数,对所述预置评估器进行训练,直至所述损失函数收敛,得到目标评估器;Training the preset evaluator through the loss function of the preset evaluator until the loss function converges to obtain a target evaluator;
    通过所述目标评估器,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度。Through the target evaluator, the value of the feature vector information is evaluated, and the participant contribution degree corresponding to each participating terminal is obtained.
  18. 如权利要求15所述的计算机可读存储介质,所述计算机可读存储介质执行所述计算机指令实现所述获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率时,包括以下步骤:15. The computer-readable storage medium according to claim 15, wherein the computer-readable storage medium executes the computer instructions to realize the acquisition of feature vector information based on the private data of the organization of a plurality of participating terminals, and through a preset evaluator, When using the feature vector information to predict the selection probability and obtain the selection probability corresponding to each participating terminal, the following steps are included:
    分别向多个参与端发送模型梯度计算指令,以使得各参与端根据所述模型梯度计算指令获取参与端的机构私密数据;Respectively sending model gradient calculation instructions to multiple participating terminals, so that each participating terminal obtains the private data of the organization of the participating terminal according to the model gradient calculation instruction;
    通过所述参与端的机构私密数据,对预置参与者分配模型进行训练,并通过预置的梯度下降算法,计算经过训练后的预置参与者分配模型的参数梯度,得到各参与端对应的特征向量信息,所述机构私密数据包括医疗机构的医疗私密数据、金融机构的金融业务私密数据和保险机构的保险私密数据中的至少一种;Train the preset participant allocation model through the private institutional data of the participating end, and calculate the parameter gradient of the trained preset participant allocation model through the preset gradient descent algorithm to obtain the corresponding characteristics of each participant end Vector information, where the private institution data includes at least one of private medical data of a medical institution, private financial data of a financial institution, and private insurance data of an insurance institution;
    接收各参与端发送的各参与端对应的特征向量信息,并通过预置评估器中的梯度价值函数,对所述各参与端对应的特征向量信息进行选择概率计算,得到各参与端对应的选择 概率。Receive the feature vector information corresponding to each participating terminal sent by each participating terminal, and calculate the selection probability of the feature vector information corresponding to each participating terminal through the gradient value function in the preset evaluator, to obtain the corresponding selection of each participating terminal Probability.
  19. 如权利要求15所述的计算机可读存储介质,所述计算机可读存储介质执行所述计算机指令实现所述通过所述更新联邦评估模型计算奖励值时,包括以下步骤:15. The computer-readable storage medium according to claim 15, when the computer-readable storage medium executes the computer instructions to realize the calculation of the reward value through the updated federated evaluation model, the method comprises the following steps:
    获取所述特征向量信息的验证集数据,通过所述更新联邦模型对所述验证集数据进行验证,得到验证结果;Acquiring verification set data of the feature vector information, verifying the verification set data through the updated federated model, to obtain a verification result;
    计算所述验证结果的验证损失值,以及预设时段的移动平均损失值;Calculating the verification loss value of the verification result and the moving average loss value of the preset time period;
    对所述验证损失值和所述移动平均损失值进行差值计算,得到奖励值。A difference calculation is performed on the verification loss value and the moving average loss value to obtain a reward value.
  20. 一种基于强化学习的业务分配装置,其中,所述基于强化学习的业务分配装置包括:A service distribution device based on reinforcement learning, wherein the service distribution device based on reinforcement learning includes:
    预测模块,用于获取基于多个参与端的机构私密数据的特征向量信息,并通过预置评估器,对所述特征向量信息进行选择概率预测,得到各参与端对应的选择概率,所述预置评估器用于评估各参与端所提供的特征向量信息的梯度价值;The prediction module is used to obtain feature vector information based on the private data of multiple participating terminals, and predict the selection probability of the feature vector information through a preset evaluator to obtain the selection probability corresponding to each participating terminal. The preset The evaluator is used to evaluate the gradient value of the feature vector information provided by each participant;
    采样模块,用于通过预置采样器和所述各参与端对应的选择概率,对所述特征向量信息进行采样,得到各参与端对应的采样梯度信息;A sampling module, configured to sample the feature vector information by using a preset sampler and the selection probability corresponding to each participating terminal to obtain sampling gradient information corresponding to each participating terminal;
    更新模块,用于根据所述各参与端对应的采样梯度信息,对预置业务评估联邦模型的模型参数进行更新,得到更新联邦评估模型,并通过所述更新联邦评估模型计算奖励值,所述奖励值用于指示所述特征向量信息对于所述更新联邦评估模型的回报累计值;The update module is used to update the model parameters of the preset business evaluation federation model according to the sampling gradient information corresponding to the participating terminals to obtain the updated federation assessment model, and calculate the reward value through the updated federation assessment model. The reward value is used to indicate the cumulative return value of the feature vector information to the updated federal evaluation model;
    评估模块,用于通过所述预置评估器和所述奖励值,对所述特征向量信息进行价值评估,得到各参与端对应的参与者贡献度;An evaluation module, configured to evaluate the value of the feature vector information through the preset evaluator and the reward value, and obtain the participant contribution degree corresponding to each participating terminal;
    分配模块,用于根据所述各参与端对应的参与者贡献度,对所述多个参与端进行业务分配,得到参与者业务分配信息。The distribution module is used for performing business distribution on the multiple participating terminals according to the participant contribution degree corresponding to each participating terminal to obtain the business distribution information of the participants.
PCT/CN2021/083817 2020-11-19 2021-03-30 Method and apparatus for service allocation based on reinforcement learning WO2021208720A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011298673.1A CN112381428B (en) 2020-11-19 2020-11-19 Service distribution method, device, equipment and storage medium based on reinforcement learning
CN202011298673.1 2020-11-19

Publications (1)

Publication Number Publication Date
WO2021208720A1 true WO2021208720A1 (en) 2021-10-21

Family

ID=74585220

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/083817 WO2021208720A1 (en) 2020-11-19 2021-03-30 Method and apparatus for service allocation based on reinforcement learning

Country Status (2)

Country Link
CN (1) CN112381428B (en)
WO (1) WO2021208720A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114021464A (en) * 2021-11-09 2022-02-08 京东科技信息技术有限公司 Data processing method, device and storage medium
CN114819197A (en) * 2022-06-27 2022-07-29 杭州同花顺数据开发有限公司 Block chain alliance-based federal learning method, system, device and storage medium
CN115296927A (en) * 2022-09-28 2022-11-04 山东省计算中心(国家超级计算济南中心) Block chain-based federal learning credible fusion excitation method and system
CN116451593A (en) * 2023-06-14 2023-07-18 北京邮电大学 Reinforced federal learning dynamic sampling method and equipment based on data quality evaluation
CN117009095A (en) * 2023-10-07 2023-11-07 湘江实验室 Privacy data processing model generation method, device, terminal equipment and medium
CN117076113A (en) * 2023-08-17 2023-11-17 重庆理工大学 Industrial heterogeneous equipment multi-job scheduling method based on federal learning
CN117172338A (en) * 2023-11-02 2023-12-05 数据空间研究院 Contribution evaluation method in longitudinal federal learning scene
CN117952185A (en) * 2024-03-15 2024-04-30 中国科学技术大学 Financial field large model training method and system based on multidimensional data evaluation

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112381428B (en) * 2020-11-19 2023-09-19 平安科技(深圳)有限公司 Service distribution method, device, equipment and storage medium based on reinforcement learning
CN117575291B (en) * 2024-01-15 2024-05-10 湖南科技大学 Federal learning data collaborative management method based on edge parameter entropy

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110490335A (en) * 2019-08-07 2019-11-22 深圳前海微众银行股份有限公司 A kind of method and device calculating participant's contribution rate
US20200074236A1 (en) * 2018-08-31 2020-03-05 Hitachi, Ltd. Reward function generation method and computer system
CN110910158A (en) * 2019-10-08 2020-03-24 深圳逻辑汇科技有限公司 Federal learning revenue allocation method and system
CN112381428A (en) * 2020-11-19 2021-02-19 平安科技(深圳)有限公司 Business allocation method, device, equipment and storage medium based on reinforcement learning

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110782042B (en) * 2019-10-29 2022-02-11 深圳前海微众银行股份有限公司 Method, device, equipment and medium for combining horizontal federation and vertical federation
CN111611610B (en) * 2020-04-12 2023-05-30 西安电子科技大学 Federal learning information processing method, system, storage medium, program, and terminal

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200074236A1 (en) * 2018-08-31 2020-03-05 Hitachi, Ltd. Reward function generation method and computer system
CN110490335A (en) * 2019-08-07 2019-11-22 深圳前海微众银行股份有限公司 A kind of method and device calculating participant's contribution rate
CN110910158A (en) * 2019-10-08 2020-03-24 深圳逻辑汇科技有限公司 Federal learning revenue allocation method and system
CN112381428A (en) * 2020-11-19 2021-02-19 平安科技(深圳)有限公司 Business allocation method, device, equipment and storage medium based on reinforcement learning

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114021464A (en) * 2021-11-09 2022-02-08 京东科技信息技术有限公司 Data processing method, device and storage medium
CN114819197A (en) * 2022-06-27 2022-07-29 杭州同花顺数据开发有限公司 Block chain alliance-based federal learning method, system, device and storage medium
CN114819197B (en) * 2022-06-27 2023-07-04 杭州同花顺数据开发有限公司 Federal learning method, system, device and storage medium based on blockchain alliance
CN115296927A (en) * 2022-09-28 2022-11-04 山东省计算中心(国家超级计算济南中心) Block chain-based federal learning credible fusion excitation method and system
CN116451593A (en) * 2023-06-14 2023-07-18 北京邮电大学 Reinforced federal learning dynamic sampling method and equipment based on data quality evaluation
CN116451593B (en) * 2023-06-14 2023-11-14 北京邮电大学 Reinforced federal learning dynamic sampling method and equipment based on data quality evaluation
CN117076113A (en) * 2023-08-17 2023-11-17 重庆理工大学 Industrial heterogeneous equipment multi-job scheduling method based on federal learning
CN117009095A (en) * 2023-10-07 2023-11-07 湘江实验室 Privacy data processing model generation method, device, terminal equipment and medium
CN117009095B (en) * 2023-10-07 2024-01-02 湘江实验室 Privacy data processing model generation method, device, terminal equipment and medium
CN117172338A (en) * 2023-11-02 2023-12-05 数据空间研究院 Contribution evaluation method in longitudinal federal learning scene
CN117172338B (en) * 2023-11-02 2024-02-02 数据空间研究院 Contribution evaluation method in longitudinal federal learning scene
CN117952185A (en) * 2024-03-15 2024-04-30 中国科学技术大学 Financial field large model training method and system based on multidimensional data evaluation

Also Published As

Publication number Publication date
CN112381428B (en) 2023-09-19
CN112381428A (en) 2021-02-19

Similar Documents

Publication Publication Date Title
WO2021208720A1 (en) Method and apparatus for service allocation based on reinforcement learning
Liu et al. Gtg-shapley: Efficient and accurate participant contribution evaluation in federated learning
CN112465626B (en) Combined risk assessment method based on client classification aggregation and related equipment
CN108364195B (en) User retention probability prediction method and device, prediction server and storage medium
CN114297722B (en) Privacy protection asynchronous federal sharing method and system based on block chain
US7366680B1 (en) Project management system and method for assessing relationships between current and historical projects
CN112799708B (en) Method and system for jointly updating business model
CN111105240A (en) Resource-sensitive combined financial fraud detection model training method and detection method
Byanjankar Predicting credit risk in Peer-to-Peer lending with survival analysis
Eddy et al. Credit scoring models: Techniques and issues
Byanjankar et al. Data‐driven optimization of peer‐to‐peer lending portfolios based on the expected value framework
CN115049011A (en) Method and device for determining contribution degree of training member model of federal learning
US20200034831A1 (en) Combining explicit and implicit feedback in self-learning fraud detection systems
US20140344020A1 (en) Competitor pricing strategy determination
Aggarwal et al. A Structural Analysis of Bitcoin Cash's Emergency Difficulty Adjustment Algorithm
CN112308623A (en) High-quality client loss prediction method and device based on supervised learning and storage medium
Dorner et al. Incentivizing honesty among competitors in collaborative learning and optimization
CN116596659A (en) Enterprise intelligent credit approval method, system and medium based on big data wind control
Zhao et al. Addressing budget allocation and revenue allocation in data market environments using an adaptive sampling algorithm
CN111382909A (en) Rejection inference method based on survival analysis model expansion bad sample and related equipment
CN115713389A (en) Financial product recommendation method and device
US20140344021A1 (en) Reactive competitor price determination using a competitor response model
US20140344022A1 (en) Competitor response model based pricing tool
CN114820160A (en) Loan interest rate estimation method, device, equipment and readable storage medium
US20110295767A1 (en) Inverse solution for structured finance

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21788006

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21788006

Country of ref document: EP

Kind code of ref document: A1