CN117311991A

CN117311991A - Model training method, task allocation method, device, equipment, medium and system

Info

Publication number: CN117311991A
Application number: CN202311597780.8A
Authority: CN
Inventors: 杨乐; 王彦伟; 鲁璐; 王江为
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2023-11-28
Filing date: 2023-11-28
Publication date: 2023-12-29
Anticipated expiration: 2043-11-28
Also published as: CN117311991B

Abstract

The invention discloses a model training method, a task allocation method, a device, equipment, a medium and a system in the technical field of computers. The invention can construct a discrete sample set, the discrete sample set can ensure sample independence, the discrete sample set is utilized to carry out reinforcement learning training, the training process can be prevented from being in a local optimal solution, the task allocation model finally obtained by training outputs an optimal allocation strategy, a task producer can obtain higher cost performance with the minimum total resource consumption, a task executor can obtain the maximum benefit with the maximum task quantity, and the resource consumption and the task quantity born by a task execution end are balanced.

Description

Model training method, task allocation method, device, equipment, medium and system

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a model training method, a task allocation method, a device, equipment, a medium, and a system.

Background

At present, when an edge computing task is distributed, the distribution strategy is generally determined by considering the processor performance, the memory size and other computing resources of the task execution end, and the distribution strategy determined in this way can finish the task, but more resources in a system can be consumed, the tasks born by different task execution ends cannot reach the maximum benefit, and the cost performance of the distribution strategy is not high.

Therefore, how to balance the resource consumption amount and the task amount born by the task execution end to formulate the task allocation policy is a problem that needs to be solved by those skilled in the art.

Disclosure of Invention

In view of the above, the present invention aims to provide a model training method, a task allocation method, a device, a medium and a system, which are used for formulating a task allocation strategy by balancing the resource consumption and the task amount born by a task execution end. The specific scheme is as follows:

in a first aspect, the present invention provides a model training method, including:

acquiring task information generated by a plurality of edge devices in an edge computing system in each time slot;

determining the allocation strategy of the task information of each time slot according to a preset reward function or in a random mode; the allocation policy is for: each task generated by the plurality of edge devices in a single time slot is distributed to a server in the edge computing system, an accelerator in the edge computing system or a local edge device; the preset reward function aims at minimizing the total resource consumption of the edge computing system in a single time slot, maximizing the task amount processed by the server and maximizing the task amount processed by the accelerator;

Determining punishment information corresponding to the allocation strategy of each time slot;

the task information, the allocation strategy and the punishment information of the same target time slot and the task information of the next time slot of the target time slot are constructed into sample data, and the sample data are filled into a discrete sample set;

performing reinforcement learning training by using the discrete sample set to obtain a task allocation model, wherein the task allocation model is used for: and determining an optimal allocation strategy for task information generated by the plurality of edge devices in a single time slot.

Optionally, the determining punishment information corresponding to the allocation policy of each time slot includes:

determining an execution end of each task under each time slot according to an allocation strategy of each time slot, and calculating the resource consumption of each task at the corresponding execution end; the execution end is the server, the accelerator or local edge equipment;

if the resource consumption of any task exceeds the preset execution condition, calculating the punishment value of the time slot;

if the resource consumption of any task does not exceed the preset execution condition, calculating the rewarding value of the time slot;

and taking the punishment value or the rewarding value as punishment information of the corresponding task.

Optionally, the calculating the resource consumption of each task at the corresponding execution end includes:

according to the data processing amount of each task and the terminal characteristic information of the corresponding execution terminal, calculating the time delay, energy consumption and cost of the corresponding task at the corresponding execution terminal;

the resource consumption is determined comprehensively based on time delay, energy consumption and cost.

Optionally, the calculating the time delay, the energy consumption and the cost of the corresponding task at the corresponding execution end according to the data processing capacity of each task and the end characteristic information of the corresponding execution end includes:

for each task, if the execution end of the current task is the local edge equipment, calculating local time delay according to the data processing amount of the current task and the processor performance data of the local edge equipment; calculating local energy consumption according to the data processing amount of the current task, the processor performance data of the local edge equipment, the processor technological parameters of the local edge equipment and the processor resources required to be consumed by the local edge equipment for processing unit bits; the cost spent by the current task at the local edge device is set to zero. Or for each task, if the execution end of the current task is the server, calculating the server time delay according to the data processing amount of the current task, the rate at which the current task is received by the server, the processor resource required to be consumed by the unit bit of the server processing and the processor resource consumed by the current task at the server; calculating the energy consumption of the server according to the data processing amount of the current task, the rate of receiving the current task by the server and the uploading rate of the current task; and calculating the cost consumed by the current task at the server according to the data processing amount of the current task, the processor resource to be consumed by the unit bit of the server processing and the unit price of the resource. Or for each task, if the execution end of the current task is the accelerator, calculating the accelerator time delay according to the data processing amount of the current task, the speed of the accelerator receiving the current task, the processor resource required to be consumed by the accelerator processing unit bit and the processor resource consumed by the current task at the accelerator; according to the data processing amount of the current task, the speed of the accelerator for receiving the current task and the uploading speed of the current task, calculating the energy consumption of the accelerator; and calculating the expense consumed by the current task at the accelerator according to the data processing amount of the current task, the processor resource to be consumed by the accelerator processing unit bit and the resource unit price.

Optionally, the resource consumption of the task exceeding the preset execution condition includes: the actual latency of a task exceeds the maximum allowable latency of the task, the actual cost of the task exceeds the maximum cost budget of the task, and/or the processor resources that the task needs to consume exceed the free processor resources of the executing end executing the task.

Optionally, for each time slot, calculating the task amount processed by the server and the task amount processed by the accelerator in the current time slot; calculating the total resource consumption of each task in the current time slot; and obtaining a punishment value or a rewarding value under the current time slot according to the task quantity processed by the server under the current time slot, the task quantity processed by the accelerator and the total resource consumption.

Optionally, calculating a penalty value of a single slot according to a first formula; the first formula is:；P _t for time slotstPenalty values of (2); />For time slotstThe server gain corresponding to the task amount processed by the server is added with the accelerator gain corresponding to the task amount processed by the accelerator; />For time slotstThe total amount of resource consumption for each next task,x ₁ the sum of the benefits corresponds to a weight value,x ₂ the weight value corresponding to the total consumption of the resources; exp represents an exponential function with a base of a constant e;

Correspondingly, calculating the rewarding value of a single time slot according to a second formula; the second formula is:；H _t for time slotstIs a prize value for (1); />For time slotstThe server gain corresponding to the task amount processed by the server is added with the accelerator gain corresponding to the task amount processed by the accelerator; />For time slotstThe total amount of resource consumption for each next task,x ₁ the sum of the benefits corresponds to a weight value,x ₂ the weight value corresponding to the total consumption of the resources; exp represents an exponential function based on a constant e.

Optionally, the performing reinforcement learning training by using the discrete sample set to obtain a task allocation model includes:

and performing reinforcement learning training on the Q function in the Q network to be trained by using the discrete sample set so as to enable the Q function in the Q network to be trained to obtain an optimal Q value on the premise of following the constraint condition and the target of the preset rewarding function, thereby obtaining the task allocation model.

Optionally, the performing reinforcement learning training on the Q function in the Q network to be trained by using the discrete sample set, so as to obtain an optimal Q value of the Q function in the Q network to be trained on the premise of following the constraint condition and the target of the preset reward function, to obtain the task allocation model, where the reinforcement learning training includes:

Determining a sample extraction mode, and extracting a target sample group from the discrete sample set according to the sample extraction mode;

inputting task information and allocation strategies of the same time slot in the same sample data in the target sample group into the Q network to be trained, so that a Q function in the Q network to be trained outputs a training result according to constraint conditions and targets of the preset reward function;

inputting punishment information of the last time slot and task information of the next time slot in the same sample data in the target sample group into a target Q network, so that the target Q network outputs a target result according to constraint conditions and targets of the preset reward function;

calculating a loss according to the training result and the target result;

updating network parameters of the Q network to be trained by using the loss, and executing a sample extraction mode determination and subsequent steps to iteratively train the Q network to be trained;

and when the loss meets the expectation, determining the current Q network to be trained as the task allocation model.

Optionally, determining the allocation policy of the task information of each time slot according to a preset reward function includes:

and inputting the task information of each time slot into the current Q network to be trained so that the current Q network to be trained outputs the allocation strategy of each time slot.

Optionally, the determining a sample extraction mode, and extracting the target sample group from the discrete sample set according to the sample extraction mode includes:

generating a target random number by utilizing a random function;

if the target random number is larger than a preset threshold value, extracting the target sample group from the discrete sample set in a random sampling mode; otherwise, the target sample group is extracted from the discrete sample set in a selective sampling mode.

Optionally, the extracting the target sample group from the discrete sample set in a selective sampling manner includes:

and selecting sample data with the value of the punishment information larger than a fixed value from the discrete sample set as the target sample set.

Optionally, before the generating the target random number by using the random function, the method further includes:

detecting whether the current iteration number meets the adjustment requirement of the preset threshold value;

and if so, adjusting the preset threshold according to a preset strategy.

Optionally, the method further comprises:

detecting whether the current iteration times reach a network parameter synchronization condition;

and if the network parameter synchronization condition is met, synchronizing the network parameters of the current Q network to be trained to the target Q network.

Optionally, the calculating the loss according to the training result and the target result includes:

calculating the loss according to a target formula; the target formula is:l represents the loss, Q1 represents the target result, Q2 represents the training result, N is the number of sample data in the target sample group, i=1, 2,3, …, N.

Optionally, the method further comprises:

and inputting task information generated by the plurality of edge devices in a single time slot into the task allocation model, so that the task allocation model determines the execution end of each task and the processor resources occupied by each task at the corresponding execution end in the current time slot according to the constraint condition and the target of the preset rewarding function, and an optimal allocation strategy conforming to the target of the preset rewarding function is obtained.

In a second aspect, the present invention provides a task allocation method, including:

receiving task information generated by a plurality of edge devices in an edge computing system in a single time slot;

inputting the task information into a task distribution model so that the task distribution model outputs an optimal distribution strategy which accords with the target of a preset rewarding function; the task allocation model is obtained by training according to any one of the methods;

And distributing the task information to a server in the edge computing system, an accelerator in the edge computing system or local edge equipment according to the optimal distribution strategy.

In a third aspect, the present invention provides a model training apparatus comprising:

the acquisition module is used for acquiring task information generated by a plurality of edge devices in the edge computing system in each time slot;

the first determining module is used for determining the allocation strategy of the task information of each time slot according to a preset rewarding function or in a random mode; the allocation policy is for: each task generated by the plurality of edge devices in a single time slot is distributed to a server in the edge computing system, an accelerator in the edge computing system or a local edge device; the preset reward function aims at minimizing the total resource consumption of the edge computing system in a single time slot, maximizing the task amount processed by the server and maximizing the task amount processed by the accelerator;

the second determining module is used for determining punishment information corresponding to the allocation strategy of each time slot;

the construction module is used for constructing the task information, the allocation strategy and the punishment information of the same target time slot and the task information of the next time slot of the target time slot into sample data, and filling the sample data into a discrete sample set;

The training module is used for performing reinforcement learning training by utilizing the discrete sample set to obtain a task allocation model, and the task allocation model is used for: and determining an optimal allocation strategy for task information generated by the plurality of edge devices in a single time slot.

In a fourth aspect, the present invention provides a task allocation device, including:

the receiving module is used for receiving task information generated by a plurality of edge devices in the edge computing system in a single time slot;

the processing module is used for inputting the task information into a task distribution model so that the task distribution model outputs an optimal distribution strategy which accords with the target of a preset rewarding function; the task allocation model is obtained by training according to any one of the methods;

and the allocation module is used for allocating the task information to a server in the edge computing system, an accelerator in the edge computing system or local edge equipment according to the optimal allocation strategy.

In a fifth aspect, the present invention provides an electronic device, comprising:

a memory for storing a computer program;

and the processor is used for executing the computer program to realize the model training method and the task allocation method disclosed by the prior art.

In a sixth aspect, the present invention provides a readable storage medium storing a computer program, where the computer program when executed by a processor implements the foregoing disclosed model training method and task allocation method.

In a seventh aspect, the present invention provides a system comprising: the system comprises a control center, a server, an accelerator and a plurality of edge devices;

the plurality of edge devices are used for generating task information in a single time slot;

the control center is used for inputting the task information into a task distribution model so that the task distribution model outputs an optimal distribution strategy which accords with a target of a preset reward function; distributing the task information to the server, the accelerator or the local edge equipment according to the optimal distribution strategy; the task allocation model is trained and obtained according to any one of the above methods.

According to the scheme, the beneficial effects of the invention are as follows: determining an allocation strategy for each task under a single time slot by taking the minimum total resource consumption, the maximum task amount processed by a server and the maximum task amount processed by an accelerator of an edge computing system under the single time slot (time period) as targets; or determining allocation strategies for each task under a single time slot in a random manner; then determining the corresponding punishment information of the allocation strategy of each time slot (the punishment information is used for describing the advantages and disadvantages of the corresponding allocation strategy); the task information, the allocation strategy and the reward and punishment information of the same target time slot and the task information of the next time slot of the target time slot are constructed into sample data, and the sample data are filled into a discrete sample set, so that the sample data in the discrete sample set are mutually independent, and the sample data not only comprise sample data which meet the target of a preset reward function, but also comprise random sample data which do not meet the target of the preset reward function, thereby ensuring the convergence rate of training, and ensuring the sample independence; and finally, performing reinforcement learning training by using a discrete sample set, avoiding the training process from being in a local optimal solution, and finally obtaining a task allocation model which can determine an optimal allocation strategy with the minimum total resource consumption, the maximum task amount processed by a server and the maximum task amount processed by an accelerator (i.e. the target conforming to a preset rewarding function) under a single time slot for task information generated by a plurality of edge devices. The optimal allocation strategy can enable the task producer to obtain higher cost performance with the minimum total resource consumption, and enable the task executor to obtain maximum benefit with the maximum task consumption, so that the resource consumption and the task amount born by the task execution end are balanced.

Correspondingly, the model training and task allocation device, equipment, medium and system provided by the invention also have the technical effects.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a model training method disclosed by the invention;

FIG. 2 is a schematic diagram of an edge computing system according to the present disclosure;

FIG. 3 is a flow chart of a task allocation method of the present disclosure;

FIG. 4 is a schematic diagram of a model training process according to the present disclosure;

fig. 5 is a schematic diagram of a Q network structure according to the present disclosure;

FIG. 6 is a schematic diagram of a model training apparatus according to the present disclosure;

FIG. 7 is a schematic diagram of a task assigning apparatus according to the present disclosure;

FIG. 8 is a diagram of a server according to the present invention;

fig. 9 is a diagram of a terminal structure according to the present invention;

Fig. 10 is a schematic diagram of a system according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, when an edge computing task is distributed, the distribution strategy is generally determined by considering the processor performance, the memory size and other computing resources of the task execution end, and the distribution strategy determined in this way can finish the task, but more resources in a system can be consumed, the tasks born by different task execution ends cannot reach the maximum benefit, and the cost performance of the distribution strategy is not high. Therefore, the invention provides a model training and task allocation scheme, which can balance the resource consumption and the task amount born by a task execution end to formulate an optimal task allocation strategy, and allocate each task generated by a plurality of edge devices in a single time slot to a server in an edge computing system, an accelerator in the edge computing system or local edge devices according to the optimal task allocation strategy, so that a task producer can obtain higher cost performance with the minimum total resource consumption, and a task executor can obtain maximum benefit with the maximum task amount. The minimum total amount of resource consumption means that less computing resources are consumed and less fees are paid.

Referring to fig. 1, the embodiment of the invention discloses a model training method, which comprises the following steps:

s101, task information generated by a plurality of edge devices in an edge computing system in each time slot is obtained.

In this embodiment, the edge computing system includes: a server, an accelerator, and a plurality of edge devices. These edge devices generate corresponding computing tasks in each time slot (time period), and one edge device can generate at least one task in a single time slot. Of course, an edge device may also generate zero tasks within a single time slot.

Referring to fig. 2, an edge computing system includes a server, an FPGA (Field-Programmable Gate Array, field programmable gate array) accelerator, and a plurality of mobile terminals, each of which generates tasks that are arranged in a task queue. The mobile terminal comprises the following steps: edge devices. The server and the FPGA accelerator have powerful computing functions, and can provide high-strength computing services for different mobile terminals in the range of the server and the FPGA accelerator. The mobile terminal can select to locally calculate the tasks generated by itself, offload the tasks to a server containing a CPU (Central Processing Unit ) for processing or offload the tasks to an FPGA accelerator for processing. In a single time slot, the mobile terminal sends an unloading request to the CPU server through the base station, or the CPU server is used for unloading the task to the FPGA accelerator for processing.

S102, determining the allocation strategy of the task information of each time slot according to a preset reward function or in a random mode.

The tasks generated by each edge device may be performed at the local edge device, at a server, or at an accelerator. A single task has inseparability. Accordingly, the allocation policy is for: each task generated by a plurality of edge devices in a single time slot is distributed to a server in an edge computing system, an accelerator in the edge computing system or a local edge device.

The preset reward function provided in this embodiment targets that the total resource consumption of the edge computing system is minimum, the task amount processed by the server is maximum, and the task amount processed by the accelerator is maximum in a single time slot, so that the task producer (i.e., the edge device) can obtain higher cost performance with the minimum total resource consumption, and the task executor (i.e., the server or the accelerator) can obtain the maximum benefit with the maximum task amount.

Specifically, at each time slot, it may be randomly selected whether to determine the allocation policy according to a preset reward function or to determine the allocation policy in a random manner. Specifically, a random number can be generated, the random number is compared with a preset value, when the random number is not larger than the preset value, the allocation strategy is determined according to a preset reward function, and when the random number is larger than the preset value, the allocation strategy is determined in a random mode.

S103, determining punishment information corresponding to the allocation strategy of each time slot.

The reward and punishment information is used for describing the advantages and disadvantages of the corresponding allocation strategies, and the allocation strategy of the target which is closer to the preset reward function is better than the allocation strategy of the target which is not closer to the preset reward function. The target of the preset reward function is: the edge computing system has the smallest total resource consumption and the largest amount of tasks processed by the server and the largest amount of tasks processed by the accelerator under a single time slot.

In one embodiment, determining punishment information corresponding to an allocation policy for each time slot includes: determining an execution end of each task under each time slot according to an allocation strategy of each time slot, and calculating the resource consumption of each task at the corresponding execution end; the execution end is a server, an accelerator or local edge equipment; if the resource consumption of any task exceeds the preset execution condition, calculating the punishment value of the time slot; if the resource consumption of any task does not exceed the preset execution condition, calculating the rewarding value of the time slot; and taking the punishment value or the rewarding value as punishment information of the corresponding task. Wherein, the resource consumption of the task exceeds the preset execution condition comprises: the actual latency of a task exceeds the maximum allowable latency of the task, the actual cost of the task exceeds the maximum cost budget of the task, and/or the processor resources that the task needs to consume exceed the free processor resources of the executing end executing the task.

In order to make the allocation strategy for constructing the sample data closer to the target of the preset reward function, the task information of a single time slot can be directly processed by the training Q network to be trained, so that the training Q network to be trained outputs a corresponding allocation strategy, and the allocation strategy is the same as the allocation strategy determined according to the preset reward function and can meet the target of the preset reward function. In one embodiment, determining the allocation policy of the task information of each time slot according to the preset reward function includes: and inputting the task information of each time slot into the current Q network to be trained so that the current Q network to be trained outputs the allocation strategy of each time slot.

It should be noted that, the allocation policy determined in a random manner may exceed the preset execution condition, and then the penalty value is calculated accordingly, and the allocation policy determined according to the preset reward function generally does not exceed the preset execution condition, and then the reward value is calculated accordingly. As the constraint is followed when determining the allocation policy according to the preset reward function. The constraint conditions of the preset reward function comprise: c1 to C8 are as follows:

the expression: one taskdWhen assigned to a server or accelerator or local edge device, its actual latency does not exceed the maximum allowed latency of the task +. >。

The expression: one taskdWhen distributed to a server or accelerator or local edge device, the actual cost paid does not exceed the maximum cost budget for the task。

The expression: one taskdAssigned to a server or accelerator or to a local edge device. When->When it means that the task is performed at the local edge device, when +.>Indicating that the task is offloaded to the server for execution, when +.>The time indicates that the unloading is performed on the accelerator.

The expression: />Take the value 0 or 1.

The expression: one taskdWhen allocated to an accelerator, it consumes no more processor resources of the accelerator than the maximum resources of the accelerator itself.

The expression: one taskdWhen allocated to a server, it consumes no more processor resources of the server than the maximum resources of the server itself.

The expression: one taskdWhen allocated to a server, it consumes more than 0 processor resources of the server.

The expression: one taskdWhen allocated to an accelerator, it consumes more than 0 of the accelerator's processor resources.

In one embodiment, the calculation of the penalty or prize value includes: calculating the task quantity processed by the server and the task quantity processed by the accelerator in the current time slot for each time slot; calculating the total resource consumption of each task in the current time slot; and obtaining a penalty value or a reward value under the current time slot according to the task quantity processed by the server under the current time slot, the task quantity processed by the accelerator and the total resource consumption.

In one embodiment, a penalty value for a single slot is calculated according to a first formula; the first formula is:；P _t for time slotstPenalty values of (2); />For time slotstThe server gain corresponding to the task amount processed by the lower server and the accelerator gain corresponding to the task amount processed by the accelerator are added; />For time slotstThe total amount of resource consumption for each next task,x ₁ the sum of the benefits corresponds to a weight value,x ₂ the weight value corresponding to the total consumption of the resources; exp represents an exponential function based on a constant e.

Correspondingly, calculating the rewarding value of a single time slot according to a second formula; the second formula is:；H _t for time slotstIs a prize value for (1); />For time slotstThe server gain corresponding to the task amount processed by the lower server and the accelerator gain corresponding to the task amount processed by the accelerator are added; />For time slotstThe total amount of resource consumption for each next task,x ₁ the sum of the benefits corresponds to a weight value,x ₂ the weight value corresponding to the total consumption of the resources; exp represents an exponential function based on a constant e.

It can be seen that the penalty and prize values are calculated similarly, except that the penalty takes a negative value and the prize takes a positive value. Correspondingly, the expression of the preset reward function is:；P _t =-H _t =-R _t 。

in one embodiment, calculating the resource consumption of each task at the corresponding execution end includes: according to the data processing amount of each task and the terminal characteristic information of the corresponding execution terminal, calculating the time delay, energy consumption and cost of the corresponding task at the corresponding execution terminal; the resource consumption is determined comprehensively based on time delay, energy consumption and cost. Therefore, the resource consumption is a value comprehensively determined based on the time delay, the energy consumption and the cost, and in order to collect the data of the time delay, the energy consumption and the cost in three different dimensions, the normalization processing can be performed on the time delay, the energy consumption and the cost respectively, and the normalization processing is performed by adopting the log mode normalization method in the embodiment.

S104, constructing task information, allocation strategy and punishment information of the same target time slot and task information of the next time slot of the target time slot into sample data, and filling the sample data into a discrete sample set.

In one example, time slotstTask information of (a) isS _t Time slottIs to allocate the policy ofA _t Time slottThe punishment and punishment information of (1) is punishment valueP _t (also equal to-R _t ) Time slottThe task information of +1 isS _t+1 Time slott+1 allocation policy ofA _t+1 Time slottThe reward and punishment information of +1 is a reward valueH _t+1 (also equal toR _t+1 ) Then one sample data includes:S _t 、A _t 、P _t andS _t+1 the method comprises the steps of carrying out a first treatment on the surface of the Accordingly, another sample data may include:S _t+1 、A _t+1 、H _t+1 andS _t+2 . Wherein the time slottTask information state of (a) isS _t Can be expressed as: the method comprises the steps of carrying out a first treatment on the surface of the Time slotstThe next includes D tasks, the first row represents the data size of task 1 +.>Maximum allowable delay->And maximum cost budget->The second row indicates the data size of task 2 +.>Maximum allowable delay->And maximum cost budget->The third row indicates the data size of task 3 +.>Maximum allowable delay->And maximum cost budget->And so on.

Wherein the time slottThe allocation policy actions of (a) areA _t Can be expressed as:

；

wherein,A _t the first three elements of each row respectively correspond to the local edge equipment, the server and the accelerator, and when the corresponding element value is 1, the task corresponding to the row is executed by the local edge equipment, the server or the accelerator, so that only one element of the first three elements of each row is 1, and the other two elements are 0; correspondingly, one of the last two elements of each row takes a value of 0; Representing time slotstProcessor resources consumed when the next task 1 is executed by the server; />Representing time slotstThe next task 1 is accelerator resources consumed when it is executed by the accelerator.

S105, performing reinforcement learning training by using the discrete sample set to obtain a task allocation model.

Wherein the task allocation model is used for: an optimal allocation policy is determined for task information generated by a plurality of edge devices in a single time slot. In one embodiment, task information generated by a plurality of edge devices in a single time slot is input into a task allocation model, so that the task allocation model determines execution ends of all tasks in a current time slot and processor resources occupied by all tasks at corresponding execution ends according to constraint conditions and targets of a preset rewarding function, and an optimal allocation strategy meeting the targets of the preset rewarding function is obtained. That is: the execution end of each task and the processor resources occupied by each task at the corresponding execution end in one time slot can be determined through the optimal allocation strategy, the total resource consumption of each task in the time slot is minimum, the task amount born by the server is maximum, and the task amount born by the accelerator is also maximum, so that each edge device can obtain higher cost performance with the minimum total resource consumption, the server and the accelerator can obtain the maximum benefit with the maximum task amount, and the relative balance between the server and the accelerator is obtained.

In one embodiment, calculating the time delay, energy consumption and cost of each task at the corresponding execution end according to the data processing capacity of each task and the end characteristic information (such as processor performance data, processor technological parameters and the like) of the corresponding execution end comprises: for each task, if the execution end of the current task is the local edge equipment, calculating local time delay according to the data processing amount of the current task and the processor performance data of the local edge equipment; calculating local energy consumption according to the data processing amount of the current task, the processor performance data of the local edge equipment, the processor technological parameters of the local edge equipment and the processor resources required to be consumed by the local edge equipment for processing unit bits; the cost spent by the current task at the local edge device is set to zero. For each task, if the execution end of the current task is a server, calculating the server time delay according to the data processing amount of the current task, the rate at which the current task is received by the server, the processor resources required to be consumed by the unit bit of the server processing and the processor resources consumed by the current task at the server; calculating the energy consumption of the server according to the data processing amount of the current task, the rate of receiving the current task by the server and the uploading rate of the current task; and calculating the cost consumed by the current task at the server according to the data processing amount of the current task, the processor resource to be consumed by the unit bit of the server processing and the unit price of the resource. For each task, if the execution end of the current task is an accelerator, calculating the accelerator time delay according to the data processing amount of the current task, the speed of the accelerator receiving the current task, the processor resources required to be consumed by the accelerator processing unit bits and the processor resources consumed by the current task at the accelerator; according to the data processing amount of the current task, the speed of the accelerator for receiving the current task and the uploading speed of the current task, calculating the energy consumption of the accelerator; and calculating the expense consumed by the current task at the accelerator according to the data processing amount of the current task, the processor resource to be consumed by the accelerator processing unit bit and the resource unit price. The process can refer to the following formula (2) to formula (11).

In one embodiment, reinforcement learning training is performed using a discrete sample set to obtain a task allocation model, including: and performing reinforcement learning training on the Q function in the Q network to be trained by using the discrete sample set so as to obtain an optimal Q value of the Q function in the Q network to be trained on the premise of following the constraint condition and the target of the preset rewarding function, thereby obtaining a task allocation model. As can be seen, in this embodiment, the Q function in the Q network to be trained is reinforcement learning trained by using the discrete sample set, which aims at: and enabling the optimal Q value of the Q function to follow the constraint condition and the target of the preset rewarding function, and finally enabling the Q network to be trained to become a task allocation model.

In one embodiment, the performing reinforcement learning training on the Q function in the Q network to be trained by using the discrete sample set, so as to obtain an optimal Q value of the Q function in the Q network to be trained on the premise of following the constraint condition and the target of the preset reward function, to obtain a task allocation model, including: determining a sample extraction mode, and extracting a target sample group from a discrete sample set according to the sample extraction mode; inputting task information and allocation strategies of the same time slot in the same sample data in the target sample group into a Q network to be trained, so that a Q function in the Q network to be trained outputs a training result according to constraint conditions and targets of a preset reward function; inputting punishment information of the last time slot and task information of the next time slot in the same sample data in the target sample group into a target Q network, so that the target Q network outputs a target result according to constraint conditions and targets of a preset reward function; calculating loss according to the training result and the target result; updating network parameters of the Q network to be trained by using the loss, and executing the steps of determining a sample extraction mode and the follow-up steps so as to carry out iterative training on the Q network to be trained; and when the loss meets the expectation, determining the current Q network to be trained as a task allocation model.

In one embodiment, determining a sample extraction method and extracting a target sample group from a discrete sample set according to the sample extraction method includes: generating a target random number by utilizing a random function; if the target random number is greater than a preset threshold value, extracting a target sample group from the discrete sample set in a random sampling mode; otherwise, the target sample group is extracted from the discrete sample set in a selective sampling manner.

In one embodiment, extracting a set of target samples from a set of discrete samples in a selective sampling manner includes: sample data with a value of punishment information larger than a fixed value is selected from the discrete sample set to be used as a target sample set so as to extract high-quality sample data.

In one embodiment, before generating the target random number using the random function, the method further includes: detecting whether the current iteration number meets the adjustment requirement of a preset threshold value; if the value of the sample data is larger, the preset threshold value is adjusted according to the preset strategy, so that the sample data with larger value of the reward and punishment information can be extracted with larger probability next time.

In one embodiment, the method further comprises: detecting whether the current iteration times reach a network parameter synchronization condition; and if the network parameter synchronization condition is met, synchronizing the network parameters of the current Q network to be trained to the target Q network.

In one embodiment, calculating the loss from the training results and the target results includes: calculating loss according to a target formula; the target formula is:l represents a loss, Q1 represents a target result, Q2 represents a training result, N is the number of sample data in the target sample group, i=1, 2,3, …, N.

It can be seen that, in this embodiment, the total resource consumption of the edge computing system in a single time slot (time slot) is minimum, the task amount processed by the server is maximum, and the task amount processed by the accelerator is maximum; or determining allocation strategies for each task under a single time slot in a random manner; and determining reward and punishment information corresponding to the allocation strategy of each time slot, constructing a discrete sample set with mutually independent sample data, and finally performing reinforcement learning training by using the discrete sample set to avoid the training process from being in a local optimal solution. The final task allocation model can determine the optimal allocation strategy with the minimum total resource consumption, the maximum task amount processed by the server and the maximum task amount processed by the accelerator as targets (namely, the targets conforming to the preset rewarding function) for the task information generated by a plurality of edge devices in a single time slot. The optimal allocation strategy can enable the task producer to obtain higher cost performance with the minimum total resource consumption, and enable the task executor to obtain maximum benefit with the maximum task consumption, so that the resource consumption and the task amount born by the task execution end are balanced.

The following describes a task allocation method provided in the embodiments of the present invention, and the task allocation method described below and other embodiments described herein may be referred to with each other.

Referring to fig. 3, the embodiment of the invention discloses a task allocation method, which comprises the following steps:

s301, receiving task information generated by a plurality of edge devices in an edge computing system in a single time slot.

S302, inputting the task information into a task distribution model so that the task distribution model outputs an optimal distribution strategy which meets the target of a preset rewarding function.

The task allocation model in the embodiment is obtained by training according to the model training method provided by the invention.

S303, distributing the task information to a server in the edge computing system, an accelerator in the edge computing system or local edge equipment according to the optimal distribution strategy.

For more specific working procedures in this embodiment, reference may be made to the corresponding contents disclosed in other embodiments, and detailed description thereof will be omitted herein.

Therefore, according to the task allocation method and the device, the task allocation is performed according to the embodiment, so that each edge device can obtain higher cost performance with the minimum total resource consumption, the server and the accelerator can obtain the maximum benefit with the maximum task quantity, and the resource consumption quantity and the task quantity born by the task execution end can be balanced.

Referring to the edge computing system shown in fig. 2, assuming that the system includes D users, one user corresponds to each mobile terminal, there are D mobile terminals in total, and the user is collectedD={1,2,…,d…, D } means that uniformly discretizing the time into T, t= {1,2, …,t…, T), each time slot having a duration of deltatAnd each mobile terminal randomly generates a computational task in each time slot, then all mobile terminals are in the time slottThe set of tasks generated in (1) is expressed asWherein each mobile terminal is in a time slottSpecific attribute information of the task generated in (1) is expressed as +.>. Within the same time slot, usersdMobile terminaldAnd tasksdCorresponding to each other.

Wherein the method comprises the steps ofIndicating the size of the task data in bits, which need to be processed, < >>Indicating the maximum time delay acceptable for processing the task in seconds, < >>Maximum budget for a task indicating that purchase of computing resources offloads the task to a CPU server, < ->(processor resource required to be consumed by the accelerator to process unit bit) is FPGA computing resource required to be consumed by the accelerator to process the unit task, and the unit is floating point operation times/bit of the FPGA accelerator card, < >>The (processor resources that need to be consumed to process a unit bit) represents the CPU computing resources that need to be consumed to process the unit task in CPU cycles/bit. In this scenario, every slot +. >Is not lower than the most of all usersDelay is tolerated to ensure that user tasks within each slot can be handled.

In addition, 1 CPU server and 1 FPGA accelerator are shared in the scene, so that high-strength and parallel computation can be provided, wherein the maximum computing resource which can be provided by the CPU server is as followsThe maximum computational resource that the FPGA accelerator can provide is +.>。

Introducing offloading decision variablesIndicated at->Terminal of each time slot>Task generated->I.e. allocation policy) and satisfies the following requirements: /> ，/>(1). Wherein when->The time indicates that the calculation process is performed locally, when +.>When it means unloading to CPU server for processing, when +.>Time-of-day representation offload to FPGA addProcessing is performed on the speed device.

(1) And (5) local calculation.

In the first placeTime slots, task->When the calculation is performed locally, the calculation capacity of the local CPU is +.>(i.e., processor performance data of the local edge device) represents the frequency of the CPU, i.e., the number of cycles of the CPU per second. Task->The delay of the calculation is: />（2）。

The energy consumption overhead incurred when performing the calculations locally is expressed as follows:(3) The method comprises the steps of carrying out a first treatment on the surface of the Wherein the method comprises the steps ofThe processor process parameters, which are the actual energy factors, of the local edge devices depend on the chip fabrication process of the local CPU.

TasksIn the local calculation, no cost is generated, so the total cost of task generation is +.>The expression is as follows:(4) In the formula, a logarithmic mode standardization method is adopted, 3 different dimension classes of energy consumption, time delay and cost are normalized, and the result is mapped to [0,1 ]]Between, fromAnd the different indexes in the unloading of the edge task are evaluated more comprehensively and accurately. Wherein->，/>And->Respectively represent mobile terminals->The preference degree is emphasized for 3 indexes of energy consumption, time delay and cost.

(2) Calculated at the CPU server.

The mobile terminal offloads the calculation task to the CPU server through the wireless channel, and then the channel transmission rate is expressed asI.e. the rate at which the server receives the task, the channel transmission rate is mainly related to the radio channel gain, the upload power of the mobile terminal device, the distance of the mobile terminal device from the edge service, etc. The energy consumption generated by the mobile terminal uploading the calculation task to the CPU server is expressed as follows: />（5），/>Is the uploading power of the mobile terminal (i.e. the uploading rate of the current task).

The mobile terminal offloads the calculation task to the CPU server for calculation, and the generated time delay mainly comprises two parts: task upload latency and task processing latency. The total delay for edge computation based on the CPU is expressed as follows: (6) Wherein->Indicating the allocation of CPU server to tasks>I.e. the processor resources consumed by the current task at the server). The maximum computing resource that the CPU server can provide is +.>。

The unit price of the CPU server per unit time per calculation resource isThe mobile terminal +.>The cost overhead of the payment calculated at the CPU server is expressed as follows: />（7）。

Then the mobile terminalThe total overhead of tasks when the CPU server performs offload computation is->The expression is as follows:

（8）。

(3) Calculation is performed at the FPGA accelerator.

When the mobile terminal selects to unload the calculation task to the FPGA accelerator, the main flow is that the mobile terminal uploads the calculation task to the CPU server, and then the CPU server transmits the calculation task to the FPGA accelerator for processing. If the FPGA accelerator is deployed very close to the CPU server side, the latency and energy consumption overhead incurred by data transfer between the CPU server and the FPGA accelerator is negligible. Accordingly, the mobile terminal can offload tasks to the FPGA accelerator, and the generated energy consumption expenditure can be. The time delay generated mainly comprises two parts:

the task uploading time delay and the processing time delay of the task on the FPGA accelerator. The total delay overhead is expressed as follows: (9) Wherein->The FPGA accelerators are allocated to the user's computing resources.

The charging unit price of the unit time unit computing resource of the FPGA accelerator isThe mobile terminal +.>The cost overhead paid at the FPGA accelerator is expressed as follows: />（10）。

Similar to the unloading at CPU server, then the mobile terminalUser-generated overhead +.>The expression is as follows:(11) The maximum computational resource that the FPGA accelerator can provide is +.>。

With reference to the above parameter definitions and related calculation formulas, the present embodiment constructs the following solving targets:(12) So that the terminalThe cost of the end is minimal, wherein equation 12 is required to follow constraints C1-C8.

On this basis, the following solving targets are constructed simultaneously:(13) The benefit of the service provider (server and accelerator) is maximized, where equation 13 is to follow constraints C2-C8. By solving the above two targets, the intelligent agent can make instant decisions on the current calculation task according to the current state, so that the user minimum cost is realized on one hand, and the service provider maximum benefit is realized on the other hand.

The task unloading and resource allocation problems are constructed into Markov decision processes, the task unloading problem of each time slot is abstracted into a dynamic solving problem, the task generation, the task unloading decision, the resource allocation decision and the task processing calculation of the mobile terminal are carried out in each time slot, meanwhile, the tasks of all terminals in each time slot can be calculated and completed, and the task parallel calculation is supported. In the Markov decision process, the agent continuously interacts with the environment and makes corresponding actions to enable the agent to reach the next state, and meanwhile, the environment gives a certain reward to the agent according to the actions and the states made by the agent. This decision process may correspond to the interaction process between the mobile terminal and the CPU server in the present invention.

First, define task state space，/>For the number of time slots>Mainly comprising time slotstThe method comprises the steps of setting a task size, a maximum time delay constraint acceptable to a user and a maximum payment computing resource cost acceptable to the user generated by each mobile terminal; single time slot in this scenariotNext, a mobile terminal generates a task. Defining policy action space，/>The method mainly comprises an unloading strategy of each task of a time slot t and a corresponding resource allocation strategy. Specifically, the->，/>。

In this example, R is used as an evaluation index to guide Network learning in DQN (Deep Q Network), and R represents an instant prize value. According to the method, a reward function based on the combined utility of the mobile terminal and the CPU server is designed, so that the utility of the mobile terminal and the service provider is comprehensively evaluated, meanwhile, a logarithmic mode standardization method is adopted, 2 different dimension classes of the utility of the mobile terminal and the utility of the CPU server are subjected to normalization processing, and the result is mapped to [0,1 ]]And the accuracy and the reliability of the evaluation result are ensured. At the same time set upAnd->The weight values of the mobile terminal and the service provider are respectively represented, and different bias degrees of utility to the two parties can be realized by adjusting the weight values. In continuous training of DQN, immediate rewards are maximized.

In addition, the invention also introduces a penalty factor, if the randomly selected unloading strategy and the corresponding resource allocation strategy do not meetConstraint conditions such as time delay, payment cost, calculation resource and the like, the environment gives a certain punishment to the agent, and the punishment value is taken as +.>. This exampleIn the bonus function，/>（14）。

Referring to fig. 4, the present invention introduces rewards and penalty factors based on the joint utility of the mobile terminal and the CPU server, and improves the DQN algorithm to make task offloading decisions. In the task stateWhen, agent selects policy action ++>Thereby entering the next task state +.>At the same time get immediate rewards->Or punishment->Thereby obtaining an experience sample +.>And plays it back into the sample cell. Specifically, the system generates a random number at any time, compares the random number with epsilon, and determines the task state ++by using the current Q network when the random number is less than or equal to epsilon>Corresponding policy actions->The method comprises the steps of carrying out a first treatment on the surface of the When the random number > epsilon, randomly selecting policy actions in the policy action space>. Then calculating penalty value or bonus value according to the above-mentioned bonus function。

After the data in the sample pool is accumulated to a certain amount, a method of combining 'instant rewarding/punishing+random extraction' is adopted to extract samples from the sample pool for training, and the samples are taken as the input of a Q-Network (Q Network for short) and a target Q Network, and the loss functions of the output results of the two networks are calculated to update the parameters of the Q Network . After each step training, the latest parameters of the Q network are +.>Copying to a target Q network, wherein in each time slot, the agent is based on the current task state +.>From policy action space->Is selected a policy action->Enter the next task state +.>Simultaneously updating the Q value, and updating the core function of the Q value as follows: />(15) The formula represents +.>Time->The Q value of the state is determined by two parts, one part is the selection policy action +.>Later rewards/>The other part is the next moment after the selection policy action +.>Is a Q value of (C). Wherein->Representing all possible strategies,/->To learn the rate, the contribution degree of the current task state and strategy action to the new estimated Q value is represented by the value [0,1 ]]。/>Representing discount factors, representing the importance of future rewards, and taking the value 0,1]. The loss function of DQN is defined as follows:

（16）。

wherein the method comprises the steps ofRepresenting parameters in the Q network, +.>Is a parameter in the target Q network. The Q value is expected to be continuously close to the target Q value through the loss function, so that the Q network is continuously updated.

The method of combining instant rewarding/punishing and random extraction is characterized in that when sampling is carried out from a sample pool, samples with highest rewarding or punishing values are selected with higher probability, and random sampling is carried out with smaller probability, so that the probability of extracting important experiences is improved, the convergence speed is increased, and meanwhile, the random sampling with lower probability reduces the correlation of the samples, and the phenomenon that the samples fall into a local optimal solution is avoided. Specifically, the system randomly generates a random number, sums the random numbers lIn contrast, the random number is less than or equal tolWhen selecting rewards or punishmentsN samples with high penalty; random number >lWhen N samples are randomly selected.

In one example, a specific training process is defined as:

input: task state setTask set of mobile terminal->FPGA accelerator resource->CPU server resource->，/>Learning rate->And discount factor->。

And (3) outputting: converged Q network:。

first, initialize Q network parametersComplete->Initializing, the same parameters +>Copy to->Initializing the target Q network->The method comprises the steps of carrying out a first treatment on the surface of the The Q network in the initial state is consistent with the target Q network. Initializing a sample pool->Setting the sample extraction quantity +.>Setting the iteration step number step for parameter synchronization, and initializing the current Q network updating iteration number +.>。

The method comprises the following specific steps: defining t=1, 2, …, T in the for loop, then generating [0,1 ]]Random number of (a)If (3)Then use the current network->Output optimal action->Then the agent performs the optimal action +.>. If it does not meet +.>Then in the action space a group of actions +.>Then the agent performs the action +.>. If->Failure to establish and/or receive->Failure to establish and/or receive->Failure and/or->If not, obtaining corresponding punishment value by the agent according to the rewarding function >The environmental state becomes +.>After which the agent willStore to sample pool->Is a kind of medium. If the constraint conditions are all satisfied, the agent obtains the corresponding rewarding value according to the rewarding function formula (14)>The environmental state becomes +.>After that the agent will +>Store to sample pool->Is a kind of medium. When the sample cell->After the sample number in (1) is full, the agent starts from +.>The method of combining instant rewards and random extraction is adopted to extract +.>Samples. I.e. < ->. For->The sample is calculated by using the target Q network to obtain +.>Calculated by using the current Q networkThen calculate the loss function according to equation (16)>Minimizing target loss, thereby updating the current Q network parameter +.>. Thereafter->If new->Is an integer multiple of step, the current Q network parameter will beCopy to target Q network->. The above is repeated, namely: the agent continues to interact with the environment, generating a sample +.>And placing the sample into a sample pool, discarding the learned samples in the sample pool, and then continuously extracting a batch of samples from the sample pool to update the Q network until convergence. Outputting the converged Q network at the end of training>As a task allocation model.

It should be noted that, in the DQN, the Q network is trained by using a neural network method, and an initial Q network structure to be trained is shown in fig. 5. Q learning is to calculate the optimal action with the maximum Q value by synchronously and continuously updating the Q value table, but because the action space dimension is larger, the allocation strategy of calculation resources belongs to continuous variables, the complexity of updating the Q value table increases exponentially, so that a neural network method in DQN can be adopted to approximate fit the Q function, the input state and action are the corresponding Q function And comparing with the target Q value, calculating a loss function so as to continuously update +.>Parameters up to +.>Parameter convergence is stable. The converged Q function enables the Q network after training to contain Q values of different actions under different states, so that when task unloading decision is made, actions with the largest Q values can be automatically output by inputting state data into the Q network, and the actions are the optimal unloading strategy.

As shown in fig. 5, the network structure of the Q network to be trained includes: the input layer 1 is D neurons, the hidden layer 2 is J neurons, the hidden layer 3 is K neurons, and the output layer 4 is output layer. The input data of the input layer is: binary sets of states and actions) The first neuron of layer 1 is connected to J neurons of the second layer, respectively, +.>、/>、/>Representing the parameters of the first neuron connected to the J neurons of the second layer, respectively. Similarly, the second neurons of the first layer are connected to J neurons of the second layer, respectively,/->、/>、/>Representing the parameters of the first layer second neurons respectively connected to the second layer J neurons respectively.

Based on the input data of the first layer, calculating the output result of the first layer as And the nonlinear transformation is carried out through an activation function and then the nonlinear transformation is used as the input of a second layer, and the specific calculation formula is as follows:

。

calculating the output result of the second layer asAnd the input of the third layer is obtained after nonlinear transformation by the activation function, and the specific calculation formula is as follows:

。

calculating the output result of the third layer asAnd the input of the fourth layer is obtained after nonlinear transformation by the activation function, and the specific calculation formula is as follows: />

。

Input data based on fourth layerNonlinear transformation by activating function>After that, the Q value calculated by the Q network is +.>(i.e. the output training result) representing the status +.>Select->Q value at that time.

Wherein,、/>、/>representing the bias between the first layer and the second layer. Similarly->、/>Representing the bias between the second layer and the third layer. />、/>、/>Etc. are denoted as activation functions for the result of the calculation +.>And performing nonlinear transformation.

Aiming at the scenes of multiple mobile terminal tasks, a single CPU server and a single FPGA accelerator, the embodiment introduces a novel CPU server carrying the FPGA, combines task calculation with a traditional CPU server, and provides a novel mode of cooperatively processing edge calculation tasks by the CPU and the FPGA. And based on the time delay, energy consumption and cost indexes, the rewarding and punishing factors of the combined utility are designed, and standardized processing of different dimensions is set, so that the calculability among different indexes is enhanced, and the comprehensive and accurate assessment of the mobile terminal utility and the service provider utility is realized. Accordingly, the agent can be comprehensively and comprehensively trained and learned. An FPGA is a programmable special purpose processor with inherent advantages in terms of intensive task processing and high concurrency processing, providing lower processing delays and power consumption. Therefore, in some computation-intensive scenes or time delay sensitive edge computation scenes, the high-efficiency computing capability of the FPGA can be introduced, and the FPGA cooperates with the CPU to jointly process task unloading in edge computation, so that the service quality is improved, and the computation delay and the energy consumption cost are effectively reduced.

In the process of training the agent, a method based on the combination of instant rewarding/punishment and random extraction is designed, and training samples are extracted from a sample pool, so that the convergence rate of training is ensured, the independence of the samples is ensured, and the dilemma of being in a local optimal solution is avoided; finally, continuously training and learning by the intelligent agent to realize Q network convergence and output the optimal unloading strategy and the resource allocation strategy of the current state.

Compared with a single CPU, the novel mode of cooperatively processing the edge calculation task by the CPU and the FPGA reduces the workload of the CPU, and meanwhile, the FPGA has the advantages of high performance, low delay, low power consumption and the like, thereby effectively reducing the processing delay and the energy consumption expenditure of the system and realizing the edge task calculation based on mixed heterogeneous resources.

A model training apparatus provided in the embodiments of the present invention is described below, and a model training apparatus described below may refer to other embodiments described herein.

Referring to fig. 6, an embodiment of the present invention discloses a model training apparatus, including:

an obtaining module 601, configured to obtain task information generated by a plurality of edge devices in an edge computing system at each time slot;

A first determining module 602, configured to determine an allocation policy of task information of each slot according to a preset reward function or in a random manner; the allocation policy is for: each task generated by a plurality of edge devices in a single time slot is distributed to a server in an edge computing system, an accelerator in the edge computing system or a local edge device; the preset reward function aims at minimizing the total resource consumption of the edge computing system under a single time slot, maximizing the task amount processed by the server and maximizing the task amount processed by the accelerator;

a second determining module 603, configured to determine punishment information corresponding to an allocation policy of each time slot;

a construction module 604, configured to construct the task information, allocation policy and punishment information of the same target time slot and the task information of the next time slot of the target time slot into sample data, and fill the sample data into a discrete sample set;

the training module 605 is configured to perform reinforcement learning training by using the discrete sample set, and obtain a task allocation model, where the task allocation model is used for: an optimal allocation policy is determined for task information generated by a plurality of edge devices in a single time slot.

In one embodiment, the second determining module is specifically configured to:

Determining an execution end of each task under each time slot according to an allocation strategy of each time slot, and calculating the resource consumption of each task at the corresponding execution end; the execution end is a server, an accelerator or local edge equipment;

In one embodiment, the second determining module is specifically configured to:

for each task, if the execution end of the current task is the local edge equipment, calculating local time delay according to the data processing amount of the current task and the processor performance data of the local edge equipment; calculating local energy consumption according to the data processing amount of the current task, the processor performance data of the local edge equipment, the processor technological parameters of the local edge equipment and the processor resources required to be consumed by the local edge equipment for processing unit bits; the cost spent by the current task at the local edge device is set to zero. Or for each task, if the execution end of the current task is a server, calculating the server time delay according to the data processing amount of the current task, the rate at which the current task is received by the server, the processor resources required to be consumed by the unit bit of the server processing and the processor resources consumed by the current task at the server; calculating the energy consumption of the server according to the data processing amount of the current task, the rate of receiving the current task by the server and the uploading rate of the current task; and calculating the cost consumed by the current task at the server according to the data processing amount of the current task, the processor resource to be consumed by the unit bit of the server processing and the unit price of the resource. Or for each task, if the execution end of the current task is an accelerator, calculating the accelerator time delay according to the data processing amount of the current task, the speed of the accelerator receiving the current task, the processor resources required to be consumed by the accelerator processing unit bit and the processor resources consumed by the current task at the accelerator; according to the data processing amount of the current task, the speed of the accelerator for receiving the current task and the uploading speed of the current task, calculating the energy consumption of the accelerator; and calculating the expense consumed by the current task at the accelerator according to the data processing amount of the current task, the processor resource to be consumed by the accelerator processing unit bit and the resource unit price.

In one embodiment, the resource consumption of the task exceeding the preset execution condition includes: the actual latency of a task exceeds the maximum allowable latency of the task, the actual cost of the task exceeds the maximum cost budget of the task, and/or the processor resources that the task needs to consume exceed the free processor resources of the executing end executing the task.

In one embodiment, for each time slot, the amount of tasks processed by the server and the amount of tasks processed by the accelerator in the current time slot are calculated; calculating the total resource consumption of each task in the current time slot; and obtaining a penalty value or a reward value under the current time slot according to the task quantity processed by the server under the current time slot, the task quantity processed by the accelerator and the total resource consumption.

In one embodiment, a penalty value for a single slot is calculated according to a first formula; the first formula is:；P _t for time slotstPenalty values of (2); />For time slotstThe server gain corresponding to the task amount processed by the lower server and the accelerator gain corresponding to the task amount processed by the accelerator are added; />For time slotstThe total amount of resource consumption for each next task,x ₁ the sum of the benefits corresponds to a weight value,x ₂ the weight value corresponding to the total consumption of the resources; exp represents an exponential function with a base of a constant e; correspondingly, calculating the rewarding value of a single time slot according to a second formula; the second formula is: / >；H _t For time slotstIs a prize value for (1); />For time slotstThe server gain corresponding to the task amount processed by the lower server and the accelerator gain corresponding to the task amount processed by the accelerator are added; />For time slotstThe total amount of resource consumption for each next task,x ₁ the sum of the benefits corresponds to a weight value,x ₂ the weight value corresponding to the total consumption of the resources; exp represents an exponential function based on a constant e.

In one embodiment, the training module is specifically configured to:

and performing reinforcement learning training on the Q function in the Q network to be trained by using the discrete sample set so as to obtain an optimal Q value of the Q function in the Q network to be trained on the premise of following the constraint condition and the target of the preset rewarding function, thereby obtaining a task allocation model.

In one embodiment, the training module is specifically configured to:

determining a sample extraction mode, and extracting a target sample group from a discrete sample set according to the sample extraction mode;

inputting task information and allocation strategies of the same time slot in the same sample data in the target sample group into a Q network to be trained, so that a Q function in the Q network to be trained outputs a training result according to constraint conditions and targets of a preset reward function;

inputting punishment information of the last time slot and task information of the next time slot in the same sample data in the target sample group into a target Q network, so that the target Q network outputs a target result according to constraint conditions and targets of a preset reward function;

Calculating loss according to the training result and the target result;

updating network parameters of the Q network to be trained by using the loss, and executing the steps of determining a sample extraction mode and the follow-up steps so as to carry out iterative training on the Q network to be trained;

and when the loss meets the expectation, determining the current Q network to be trained as a task allocation model.

In one embodiment, the first determining module is specifically configured to:

In one embodiment, the training module is specifically configured to:

generating a target random number by utilizing a random function;

if the target random number is greater than a preset threshold value, extracting a target sample group from the discrete sample set in a random sampling mode; otherwise, the target sample group is extracted from the discrete sample set in a selective sampling manner.

In one embodiment, the training module is specifically configured to:

sample data with a value of punishment information larger than a fixed value is selected from the discrete sample set to be used as a target sample set.

In one embodiment, the training module is further to:

detecting whether the current iteration number meets the adjustment requirement of a preset threshold value;

If yes, adjusting a preset threshold according to a preset strategy.

In one embodiment, the training module is further to:

In one embodiment, the training module is specifically configured to:

calculating loss according to a target formula; the target formula is:l represents a loss, Q1 represents a target result, Q2 represents a training result, N is the number of sample data in the target sample group, i=1, 2,3, …, N.

In one embodiment, the method further comprises:

the allocation module is used for inputting task information generated by the plurality of edge devices in a single time slot into the task allocation model, so that the task allocation model determines the execution end of each task in the current time slot and the processor resources occupied by each task at the corresponding execution end according to the constraint condition and the target of the preset rewarding function, and an optimal allocation strategy conforming to the target of the preset rewarding function is obtained.

The more specific working process of each module and unit in this embodiment may refer to the corresponding content disclosed in the foregoing embodiment, and will not be described herein.

Therefore, the embodiment can balance the resource consumption and the task amount born by the task execution end to formulate an optimal task allocation strategy, and allocate each task generated by a plurality of edge devices in a single time slot to a server in an edge computing system, an accelerator in the edge computing system or a local edge device according to the optimal task allocation strategy, so that a task producer can obtain higher cost performance with the minimum total resource consumption, and a task executor can obtain maximum benefit with the maximum task amount.

A task allocation device provided in the embodiments of the present invention is described below, and a task allocation device described below may refer to other embodiments described herein.

Referring to fig. 7, an embodiment of the present invention discloses a task allocation device, including:

a receiving module 701, configured to receive task information generated by a plurality of edge devices in an edge computing system in a single timeslot;

the processing module 702 is configured to input task information into a task allocation model, so that the task allocation model outputs an optimal allocation policy that meets a target of a preset reward function; training a task allocation model according to any one of the above methods;

The allocation module 703 is configured to allocate task information to a server in the edge computing system, an accelerator in the edge computing system, or a local edge device according to an optimal allocation policy.

An electronic device provided in the embodiments of the present invention is described below, and an electronic device described below may refer to other embodiments described herein.

The embodiment of the invention discloses an electronic device, which comprises:

a memory for storing a computer program;

And a processor for executing the computer program to implement the method disclosed in any of the above embodiments.

Further, the embodiment of the invention also provides electronic equipment. The electronic device may be a server as shown in fig. 8 or a terminal as shown in fig. 9. Fig. 8 and 9 are each a block diagram of an electronic device according to an exemplary embodiment, and the contents of the drawings should not be construed as any limitation on the scope of use of the present invention.

Fig. 8 is a schematic structural diagram of a server according to an embodiment of the present invention. The server specifically may include: at least one processor, at least one memory, a power supply, a communication interface, an input-output interface, and a communication bus. The memory is used for storing a computer program, and the computer program is loaded and executed by the processor to realize relevant steps in model training and task allocation disclosed in any embodiment.

In this embodiment, the power supply is configured to provide a working voltage for each hardware device on the server; the communication interface can create a data transmission channel between the server and external equipment, and the communication protocol to be followed by the communication interface is any communication protocol applicable to the technical scheme of the invention, and the communication protocol is not particularly limited; the input/output interface is used for acquiring external input data or outputting data to the external, and the specific interface type can be selected according to the specific application requirement, and is not limited in detail herein.

In addition, the memory may be a read-only memory, a random access memory, a magnetic disk, an optical disk, or the like as a carrier for storing resources, where the resources stored include an operating system, a computer program, data, and the like, and the storage mode may be transient storage or permanent storage.

The operating system is used for managing and controlling each hardware device and computer program on the Server to realize the operation and processing of the processor on the data in the memory, and the operation and processing can be Windows Server, netware, unix, linux and the like. The computer program may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the model training, task allocation method disclosed in any of the foregoing embodiments. The data may include data such as information on a developer of the application program in addition to data such as update information of the application program.

Fig. 9 is a schematic structural diagram of a terminal according to an embodiment of the present invention, where the terminal may specifically include, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, or the like.

Generally, the terminal in this embodiment includes: a processor and a memory.

The processor may include one or more processing cores, such as a 4-core processor, an 8-core processor, etc. The processor may be implemented in at least one hardware form of DSP (Digital Signal Processing ), FPGA, PLA (Programmable Logic Array, programmable logic array). The processor may also include a main processor, which is a processor for processing data in an awake state, also called a CPU, and a coprocessor; a coprocessor is a low-power processor for processing data in a standby state. In some embodiments, the processor may incorporate a GPU (Graphics Processing Unit, image processor) for rendering and rendering of content required to be displayed by the display screen. In some embodiments, the processor may also include an AI (Artificial Intelligence ) processor for processing computing operations related to machine learning.

The memory may include one or more computer-readable storage media, which may be non-transitory. The memory may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In this embodiment, the memory is at least configured to store a computer program, where the computer program, after being loaded and executed by the processor, is capable of implementing relevant steps in the model training and task allocation method performed by the terminal side as disclosed in any of the foregoing embodiments. In addition, the resources stored in the memory can also comprise an operating system, data and the like, and the storage mode can be short-term storage or permanent storage. The operating system may include Windows, unix, linux, among others. The data may include, but is not limited to, update information for the application.

In some embodiments, the terminal may further include a display screen, an input-output interface, a communication interface, a sensor, a power supply, and a communication bus.

Those skilled in the art will appreciate that the structure shown in fig. 9 is not limiting of the terminal and may include more or fewer components than shown.

A readable storage medium provided by embodiments of the present invention is described below, and the readable storage medium described below may be referred to with respect to other embodiments described herein.

A readable storage medium storing a computer program, wherein the computer program when executed by a processor implements the model training and task allocation method disclosed in the foregoing embodiments. The readable storage medium is a computer readable storage medium, and can be used as a carrier for storing resources, such as read-only memory, random access memory, magnetic disk or optical disk, wherein the resources stored on the readable storage medium comprise an operating system, a computer program, data and the like, and the storage mode can be transient storage or permanent storage.

A system provided by embodiments of the present invention is described below, and a system described below may be referred to with respect to other embodiments described herein.

Referring to fig. 10, an embodiment of the present invention discloses a system comprising: the system comprises a control center, a server, an accelerator and a plurality of edge devices; the plurality of edge devices are used for generating task information in a single time slot; the control center is used for inputting the task information into the task distribution model so that the task distribution model outputs an optimal distribution strategy which accords with the target of the preset rewarding function; distributing task information to a server, an accelerator or local edge equipment according to an optimal distribution strategy; the task allocation model is trained and obtained according to the method. The control center may be any computer device, and the control center may include an agent, which can implement the related method provided in any of the foregoing embodiments.

The more specific working process in this embodiment may refer to the corresponding content disclosed in the foregoing embodiment, and will not be described herein.

It can be seen that the present embodiment provides a system, which can implement comprehensive and accurate assessment of the mobile terminal utility and the service provider utility based on the time delay, the energy consumption and the cost index, and balance the mobile terminal utility and the service provider utility.

In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other.

The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of readable storage medium known in the art.

The principles and embodiments of the present invention have been described herein with reference to specific examples, the description of which is intended only to assist in understanding the methods of the present invention and the core ideas thereof; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims

1. A method of model training, comprising:

2. The method of claim 1, wherein determining punishment information corresponding to the allocation policy for each slot comprises:

3. The method according to claim 2, wherein calculating the resource consumption of each task at the corresponding execution end comprises:

4. A method according to claim 3, wherein calculating the time delay, the energy consumption and the cost of the corresponding task at the corresponding execution end according to the data processing capacity of each task and the end characteristic information of the corresponding execution end comprises:

for each task, if the execution end of the current task is the local edge equipment, calculating local time delay according to the data processing amount of the current task and the processor performance data of the local edge equipment; calculating local energy consumption according to the data processing amount of the current task, the processor performance data of the local edge equipment, the processor technological parameters of the local edge equipment and the processor resources required to be consumed by the local edge equipment for processing unit bits; setting the cost consumed by the current task at the local edge device to zero;

For each task, if the execution end of the current task is the server, calculating the server time delay according to the data processing amount of the current task, the rate at which the current task is received by the server, the processor resource required to be consumed by the unit bit of the server processing and the processor resource consumed by the current task at the server; calculating the energy consumption of the server according to the data processing amount of the current task, the rate of receiving the current task by the server and the uploading rate of the current task; calculating the cost consumed by the current task at the server according to the data processing amount of the current task, the processor resource to be consumed by the unit bit of the server and the unit price of the resource;

for each task, if the execution end of the current task is the accelerator, calculating the accelerator time delay according to the data processing amount of the current task, the speed of the accelerator receiving the current task, the processor resource required to be consumed by the accelerator processing unit bit and the processor resource consumed by the current task at the accelerator; according to the data processing amount of the current task, the speed of the accelerator for receiving the current task and the uploading speed of the current task, calculating the energy consumption of the accelerator; and calculating the expense consumed by the current task at the accelerator according to the data processing amount of the current task, the processor resource to be consumed by the accelerator processing unit bit and the resource unit price.

5. The method of claim 2, wherein the resource consumption of the task exceeding the preset execution condition comprises: the actual latency of a task exceeds the maximum allowable latency of the task, the actual cost of the task exceeds the maximum cost budget of the task, and/or the processor resources that the task needs to consume exceed the free processor resources of the executing end executing the task.

6. The method of claim 2, wherein for each time slot, calculating the amount of tasks processed by the server and the amount of tasks processed by the accelerator for the current time slot; calculating the total resource consumption of each task in the current time slot; and obtaining a punishment value or a rewarding value under the current time slot according to the task quantity processed by the server under the current time slot, the task quantity processed by the accelerator and the total resource consumption.

7. The method of claim 6, wherein the penalty value for a single slot is calculated according to a first formula; the first formula is:；P _t for time slotstPenalty values of (2); />For time slotstThe server gain corresponding to the task amount processed by the server is added with the accelerator gain corresponding to the task amount processed by the accelerator; / >For time slotstThe total amount of resource consumption for each next task,x ₁ the sum of the benefits corresponds to a weight value,x ₂ the weight value corresponding to the total consumption of the resources; exp represents an exponential function with a base of a constant e;

correspondingly, calculating the rewarding value of a single time slot according to a second formula; the second formula is:；H _t for time slotstIs a prize value for (1); />For time slotstThe server gain corresponding to the task amount processed by the server is added with the accelerator gain corresponding to the task amount processed by the accelerator; for time slotstThe total amount of resource consumption for each next task,x ₁ the sum of the benefits corresponds to a weight value,x ₂ the weight value corresponding to the total consumption of the resources; exp represents an exponential function based on a constant e.

8. The method according to any one of claims 1 to 7, wherein the performing reinforcement learning training using the discrete sample set to obtain a task allocation model includes:

9. The method according to claim 8, wherein the performing reinforcement learning training on the Q function in the Q network to be trained by using the discrete sample set to obtain the optimal Q value for the Q function in the Q network to be trained on the premise of following the constraint condition and the target of the preset reward function, so as to obtain the task allocation model includes:

calculating a loss according to the training result and the target result;

10. The method of claim 9, wherein determining the allocation policy of the task information for each slot according to the preset bonus function comprises:

11. The method of claim 9, wherein determining the sample extraction pattern and extracting the set of target samples from the set of discrete samples according to the sample extraction pattern comprises:

generating a target random number by utilizing a random function;

12. The method of claim 11, wherein the extracting the set of target samples from the set of discrete samples in a selective sampling manner comprises:

13. The method of claim 11, wherein prior to generating the target random number using the random function, further comprising:

And if so, adjusting the preset threshold according to a preset strategy.

14. The method as recited in claim 9, further comprising:

15. The method of claim 9, wherein said calculating a loss from said training results and said target results comprises:

16. The method according to any one of claims 1 to 7, further comprising:

17. A method of task allocation, comprising:

inputting the task information into a task distribution model so that the task distribution model outputs an optimal distribution strategy which accords with the target of a preset rewarding function; the task allocation model is trained according to the method of any one of claims 1 to 16;

18. A model training device, comprising:

19. A task assigning apparatus, comprising:

the processing module is used for inputting the task information into a task distribution model so that the task distribution model outputs an optimal distribution strategy which accords with the target of a preset rewarding function; the task allocation model is trained according to the method of any one of claims 1 to 16;

20. An electronic device, comprising:

a memory for storing a computer program;

a processor for executing the computer program to implement the method of any one of claims 1 to 17.

21. A readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements the method of any one of claims 1 to 17.

22. A system, comprising: the system comprises a control center, a server, an accelerator and a plurality of edge devices;

the control center is used for inputting the task information into a task distribution model so that the task distribution model outputs an optimal distribution strategy which accords with a target of a preset reward function; distributing the task information to the server, the accelerator or the local edge equipment according to the optimal distribution strategy; the task allocation model is trained in accordance with the method of any one of claims 1 to 16.