CN114327916B

CN114327916B - Training method, device and equipment of resource allocation system

Info

Publication number: CN114327916B
Application number: CN202210232543.0A
Authority: CN
Inventors: 徐博; 宋金泽; 熊炫棠; 王燕娜; 徐波
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2022-03-10
Filing date: 2022-03-10
Publication date: 2022-06-17
Anticipated expiration: 2042-03-10
Also published as: CN114327916A

Abstract

The invention discloses a training method, a device and equipment of a resource allocation system, wherein the method comprises the following steps: packaging a first algorithm and a first simulation engine to obtain a first execution program of the initial model; running the first executive program based on the initial model to generate at least one group of situation data; and executing training operation aiming at each configuration situation data in the at least one group of situation data until the execution results corresponding to the at least one group of situation data all meet corresponding conditions, and obtaining the resource distribution system. Through the mode, the extensible and reusable resource allocation system is established, and the resource allocation system can realize intelligent decision of multi-target and multi-resource dynamic allocation.

Description

Training method, device and equipment of resource allocation system

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a training method, a training device and training equipment for a resource distribution system.

Background

The core of the multi-target multi-resource allocation (MOMRA) problem is: a correct and reliable allocation scheme needs to be given in a short time. However, due to the constraint of algorithm performance and the complex MOMRA environment, the research on the multi-target multi-resource allocation problem is limited to a static and small-scale scene. The multi-target multi-resource dynamic allocation (MOMRDA) problem is a repetition of the multi-target multi-resource allocation (MOMRA) problem, and the complexity of the problem increases exponentially with the increase of the decision times.

Aiming at the MOMRDA problem, a scholars provides that the target situation is evaluated through an expert system knowledge base, then the quantity of distributed resources is calculated by utilizing a heuristic algorithm, and corresponding rules are summarized to solve the MOMRDA problem. However, the traditional expert system has the defects of single logic, simple structure, fixed knowledge base content, poor reusability and the like, and is mainly applied to early-stage prediction activities such as reasoning, planning and the like, so that unexpected situations are difficult to process. Meanwhile, the heuristic algorithm needs a long time to obtain the optimized distribution scheme, the scale of the processed problems is limited, and the requirement of multiple targets is difficult to meet.

In view of the MOMRDA problem, it is very heavy and wasteful to build a distribution system for each environment. Therefore, the establishment of a scalable and reusable MOMRDA system is a problem to be solved urgently by those skilled in the art.

Disclosure of Invention

In order to solve the above problems, a method, an apparatus, and a device for training a resource allocation system according to an embodiment of the present invention are provided.

According to an aspect of the embodiments of the present invention, there is provided a training method for a resource allocation system, including:

packaging a first algorithm and a first simulation engine to obtain a first executive program of an initial model;

running the first executive program based on the initial model to generate at least one group of situation data;

executing the following training operation aiming at each configuration situation data in the at least one group of situation data until the execution result corresponding to the at least one group of situation data meets the corresponding condition to obtain the resource distribution system, wherein the resource distribution system is adapted to any execution program, the any execution program is obtained by packaging a second algorithm and a second simulation engine, the first algorithm and the second algorithm are any algorithms in a preset algorithm library, and the first simulation engine and the second simulation engine are any simulation engines in a preset simulation engine library;

the training operation comprises:

inputting corresponding situation data into a standard model and a model to be trained respectively to obtain a standard result output by the standard model and a reference result output by the model to be trained, wherein the model to be trained corresponds to the initial model;

and if the standard result and the reference result do not meet the preset stop condition after the standard result and the reference result are obtained, updating the parameters of the model to be trained to obtain a new model to be trained, and repeatedly executing the steps of inputting the corresponding situation data to the standard model and the model to be trained respectively to obtain the standard result output by the standard model and the reference result output by the model to be trained.

Optionally, encapsulating the first algorithm and the first simulation engine to obtain a first execution program of the initial model, including:

resetting the environmental state of the first simulation engine based on the first algorithm to obtain a first state value;

obtaining a first action according to the first state value;

if the preset stop condition is not met after the first action is obtained, a time step is advanced, a first reward value corresponding to the first state value is obtained, and the step of obtaining the first state value is repeatedly executed.

Optionally, after obtaining the reference result output by the model to be trained, the method further includes:

and modifying the reference result to obtain a modified modification result.

Optionally, after obtaining the resource allocation system, the method further includes:

and storing and obtaining training data generated by the process of the resource allocation system.

Optionally, after storing training data generated by the process of obtaining the resource allocation system, the method further includes:

operating the resource allocation system.

Optionally, operating the resource allocation system includes:

and running the first executive program in a preset environment to obtain a processing result of the resource distribution system in the preset environment.

According to another aspect of the embodiments of the present invention, there is provided a training apparatus for a resource allocation system, the apparatus including:

the packaging module is used for packaging the first algorithm and the first simulation engine to obtain a first execution program of the initial model;

the processing module is used for running the first execution program based on the initial model to generate at least one group of situation data; executing the following training operation aiming at each configuration situation data in the at least one group of situation data until the execution result corresponding to the at least one group of situation data meets the corresponding condition to obtain the resource distribution system, wherein the resource distribution system is adapted to any execution program, the any execution program is obtained by packaging a second algorithm and a second simulation engine, the first algorithm and the second algorithm are any algorithms in a preset algorithm library, and the first simulation engine and the second simulation engine are any simulation engines in a preset simulation engine library; the training operation comprises: inputting corresponding situation data into a standard model and a model to be trained respectively to obtain a standard result output by the standard model and a reference result output by the model to be trained, wherein the model to be trained corresponds to the initial model;

and the output module is used for updating the parameters of the model to be trained to obtain a new model to be trained if the preset stop condition is not met after the standard result and the reference result are obtained, and repeatedly executing the steps of inputting the corresponding situation data to the standard model and the model to be trained respectively to obtain the standard result output by the standard model and the reference result output by the model to be trained.

According to still another aspect of an embodiment of the present invention, there is provided a computing device including: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is used for storing at least one executable instruction, and the executable instruction enables the processor to execute the operation corresponding to the training method of the resource allocation system.

According to another aspect of the embodiments of the present invention, there is provided a computer storage medium, where at least one executable instruction is stored, and the executable instruction causes a processor to perform an operation corresponding to the training method of the resource allocation system.

According to the scheme provided by the embodiment of the invention, a first executive program of an initial model is obtained by packaging a first algorithm and a first simulation engine; running the first executive program based on the initial model to generate at least one group of situation data; executing the following training operation aiming at each configuration situation data in the at least one group of situation data until the execution result corresponding to the at least one group of situation data meets the corresponding condition to obtain the resource distribution system, wherein the resource distribution system is adapted to any execution program, the any execution program is obtained by packaging a second algorithm and a second simulation engine, the first algorithm and the second algorithm are any algorithms in a preset algorithm library, and the first simulation engine and the second simulation engine are any simulation engines in a preset simulation engine library; the training operation comprises: inputting corresponding situation data into a standard model and a model to be trained respectively to obtain a standard result output by the standard model and a reference result output by the model to be trained, wherein the model to be trained corresponds to the initial model; if the standard result and the reference result do not meet the preset stop condition after the standard result and the reference result are obtained, updating the parameters of the model to be trained to obtain a new model to be trained, and repeatedly executing the steps of inputting the corresponding situation data into the standard model and the model to be trained respectively to obtain the standard result output by the standard model and the reference result output by the model to be trained, so that an extensible and reusable resource allocation system is established, and the resource allocation system can realize the intelligent decision of multi-target multi-resource dynamic allocation.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and the embodiments of the present invention can be implemented according to the content of the description in order to make the technical means of the embodiments of the present invention more clearly understood, and the detailed description of the embodiments of the present invention is provided below in order to make the foregoing and other objects, features, and advantages of the embodiments of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the embodiments of the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

FIG. 1 is a flowchart illustrating a training method of a resource allocation system according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating a process of interaction of a virtual object with an environment provided by an embodiment of the invention;

FIG. 3 is a diagram illustrating a specific user usage scenario provided by an embodiment of the present invention;

FIG. 4 illustrates a specific screen display interface diagram provided by an embodiment of the invention;

FIG. 5 is a diagram illustrating a specific input instruction according to an embodiment of the present invention;

FIG. 6 is a flow chart illustrating the optimization of network parameters provided by an embodiment of the present invention;

FIG. 7 is a block diagram of an overall architecture of a resource allocation system provided by an embodiment of the invention;

FIG. 8 is a schematic structural diagram of a training apparatus of a resource allocation system according to an embodiment of the present invention;

fig. 9 is a schematic structural diagram of a computing device provided in an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

Fig. 1 shows a flowchart of a method for training a resource allocation system according to an embodiment of the present invention. As shown in fig. 1, the method comprises the steps of:

step 11, encapsulating a first algorithm and a first simulation engine to obtain a first executive program of an initial model;

step 12, operating the first executive program based on the initial model to generate at least one group of situation data;

step 13, executing the following training operations for each configuration situation data in the at least one set of situation data until the execution results corresponding to the at least one set of situation data all satisfy corresponding conditions, so as to obtain the resource allocation system, wherein the resource allocation system is adapted to any execution program, the any execution program is obtained by encapsulating a second algorithm and a second simulation engine, the first algorithm and the second algorithm are both any algorithms in a preset algorithm library, and the first simulation engine and the second simulation engine are both any simulation engines in a preset simulation engine library;

specifically, the corresponding condition may be set to update the parameter of the model to be trained for a set number of times, but is not limited to the above.

Step 14, the training operation comprises: inputting corresponding situation data into a standard model and a model to be trained respectively to obtain a standard result output by the standard model and a reference result output by the model to be trained, wherein the model to be trained corresponds to the initial model;

and step 15, if the standard result and the reference result do not meet the preset stop condition after being obtained, updating the parameters of the model to be trained to obtain a new model to be trained, and repeatedly executing the steps of inputting the corresponding situation data to the standard model and the model to be trained respectively to obtain the standard result output by the standard model and the reference result output by the model to be trained.

In the embodiment, a first execution program of an initial model is obtained by packaging a first algorithm and a first simulation engine; running the first executive program based on the initial model to generate at least one group of situation data; executing the following training operation aiming at each configuration situation data in the at least one group of situation data until the execution results corresponding to the at least one group of situation data all meet corresponding conditions to obtain the resource distribution system, wherein the resource distribution system is adaptive to any execution program, the any execution program is obtained by encapsulating a second algorithm and a second simulation engine, the first algorithm and the second algorithm are both any algorithms in a preset algorithm library, and the first simulation engine and the second simulation engine are both any simulation engines in the preset simulation engine library; the training operation comprises: inputting corresponding situation data into a standard model and a model to be trained respectively to obtain a standard result output by the standard model and a reference result output by the model to be trained, wherein the model to be trained corresponds to the initial model; if the standard result and the reference result do not meet the preset stop condition after the standard result and the reference result are obtained, updating the parameters of the model to be trained to obtain a new model to be trained, and repeatedly executing the steps of inputting the corresponding situation data into the standard model and the model to be trained respectively to obtain the standard result output by the standard model and the reference result output by the model to be trained, so that an extensible and reusable resource allocation system is established, and the resource allocation system can realize the intelligent decision of multi-target multi-resource dynamic allocation.

In an alternative embodiment of the present invention, the first algorithm in step 11 may be any one algorithm in an algorithm library, wherein the algorithm library may be recorded as a library module, which supports synchronous/asynchronous algorithms and offline/online algorithms, and may implement multiple algorithm scheduling by means of parameter configuration, such as offline DQN algorithm, online synchronous A2C algorithm, asynchronous A3C algorithm, and the like, but is not limited to the foregoing. Meanwhile, a user can select a proper algorithm through the parameters and can also define the algorithm by self, and the module also supports algorithm expansion.

In another optional embodiment of the present invention, the first simulation engine in step 11 may be any one of simulation engines in an engine library, where the engine library may be recorded as a simulation engine library module, and the module may be compatible with at least two simulation engines, and in consideration of the complex diversity of different MOMRDA scenarios, the scenario is abstractly described as a plurality of parameters related to resources and targets, and by adjusting related parameters in a configuration file, resources of the scenario are configured, and further adjusting a balance relationship of a game, game deduction under the same context and different resources is achieved, and the module further provides a MOMRDA simulation engine sample, which may be conceived by configuring different sizes of MOMRDA scenarios through the parameters, and a user may customize the engine according to the sample.

Meanwhile, the user can access the user-defined scene through the gym general interface. The resource allocation system in the embodiment of the invention has a good man-machine cooperative interface, comprehensive displayed information and sufficient and perfect configurable interfaces, and supports a user to expand any plurality of simulation engines.

In yet another alternative embodiment of the present invention, step 11 may comprise:

step 111, resetting the environment state of the first simulation engine based on the first algorithm to obtain a first state value;

step 112, obtaining a first action according to the first state value;

step 113, if the preset stop condition is not met after the first action is obtained, advancing a time step, obtaining a first reward value corresponding to the first state value, and repeatedly executing the step of obtaining the first state value, wherein the preset stop condition may be set as: the online time period T or the set number is set, but not limited to as described above.

In this embodiment, as shown in fig. 2, the method for encapsulating the first algorithm and the first simulation engine mainly refers to the gym development standard, and uses a set of unified standards to realize the interface between different algorithms and different environments. Meanwhile, the interaction between a reinforcement learning algorithm and an environment, between an expert system and the environment and between a user and the environment is met, wherein the expert system is a standard system in the embodiment of the invention. And then, multi-process calling of the single-process simulation engine is realized by utilizing a python multi-process dependency library, and the requirement of multi-process training of a depth strengthening algorithm is met.

The core interface of the gym is env, which contains reset (), step () two core methods. On this basis, embodiments of the present invention encapsulate using the following functions:

get _ state (), corresponding to step 111 above: the state of the environment is reset and the state value state is returned. That is, when an epicode is finished, necessary computing resources are allocated and the environment is reinitialized.

Get _ action (state), corresponding to step 112 above: according to the state and other information, the action under the current state is obtained. get _ action (state) is an important function of the simulation engine environment interfacing with reinforcement learning algorithms, expert systems, and users.

Get step (action), corresponding to the above step 113, representing the virtual object interacting with the environment, advancing a time step, returning state, reward, done, info.

On the basis of a single process of a gym encapsulation simulation engine, based on the multi-process function provided by the multiprocessing dependent library in python, the python multi-process dependent library in the multi-env function is utilized to realize concurrent deduction.

When the simulation engine is changed, the gym universal interface code can be reused, so that the development efficiency can be improved.

In another optional embodiment of the present invention, in step 14, after obtaining the reference result output by the model to be trained, the method may further include:

step 141, modify the reference result to obtain a modified result.

In the embodiment, in large-scale application, a user analyzes the superiority and inferiority of actions, so that the action evaluation function can be used for automatically evaluating the superiority of the actions given by the algorithm and the actions given by the expert system, whether the algorithm actions are modified or not is decided, if the algorithm actions are required to be modified, the modified actions are input to obtain the modified actions, but the reference result is modified and not limited to the action evaluation function, wherein the expert system module builds an expert system rule which is driven by heuristic knowledge and is suitable for a target scene. The expert system rules can be used as the reference of the traditional non-intelligent distribution algorithm and the subsequent intelligent distribution algorithm for research and comparison, and the difference and the similarity between the intelligent distribution algorithm and the traditional method are obtained through analysis.

Specifically, in order to increase the convenience of human-computer interaction input, the human-computer collaboration module may use a keyboard to implement user instruction input, that is, to input a resource allocation instruction in real time, a single resource or multiple resources may be input at each decision time, and a pygame key toolkit is used to process the user instruction, where the instruction setting that can be input by the user in real time in the human-computer collaboration module is shown in table 1, but is not limited to table 1.

Use the key position	Description of the effects
		Keypad number key 1	Attack targets with subscript 1
Numeric keypad 2	Attack targets with subscript 2
		Numeric keypad keys 3	Attack targets with subscript 3
Keypad number keys 4	Attack targets with subscript 4
		…	…
Number key 1	1 resource
		Number
2 key	2 resources
		Number keys 3	3 resources
Number keys 4	4 resources
		…	…
Space key	Pause game deduction
		ESC key	Quit game deduction
F1 key	Skip out command input window

TABLE 1

In using the instruction set as shown in table 1, the numeric keys of the keypad are first pressed to select the corresponding numbered object, wherein upon selection of the corresponding numbered object, the corresponding action is performed only in the presence of the object.

If the resource distribution number is 1, the user can decide to prompt the announcement according to the display interface and use the number keys prompted above to attack the target to be attacked, and if the error influence of the execution event is not large, the user can continuously click the number keys for multiple times to meet the requirement of multiple transmission.

If the number of allocations is greater than 1, button F1 pauses the current gaming process and the system pops up an instruction dialog. As shown in fig. 6, a corresponding command may be entered in the current instruction dialog, an 8-digit number in the overall instruction format such as 12345678, each digit representing an amount of a resource.

When the command has an error, the Backspace key can be used to cancel the input of the previous command. For example, the command is 12345678, and using the Backspace key may change the command to 1234567.

Fig. 3 is a schematic diagram illustrating a specific user usage scenario provided by an embodiment of the present invention, as shown in fig. 3, in the scenario, first, a parameter α is sent to a simulation engine library, and a correspondence is selectedThe simulation environment of (2); and then sending the parameter beta to an algorithm library, selecting a corresponding algorithm, and starting training by a multi-concurrent training module after the simulation environment and the algorithm are selected. Fig. 4 shows a specific screen display interface diagram provided by an embodiment of the present invention, as shown in fig. 4, in the training process, the display interface displays the derived situation, decision-making action and damage result in real time, the user can pause at will, observe the environment change, make a decision according to the current situation, and perform action a made by the enhanced algorithm and action made by the expert system

And (6) carrying out comparison. If the action effect made by the strengthening algorithm is not as good as the action made by the expert system, the user can send a decision modification instruction to the environment

. The multiple concurrent training modules receive the instruction

Then, it is stored in a data buffer and assigned a greater importance weight to enable the intelligent algorithm to learn the actions beyond the expert system.

In another optional embodiment of the present invention, in step 13, after obtaining the resource allocation system, the method may further include:

step 131, storing training data generated by the process of obtaining the resource allocation system.

In this embodiment, training data generated by the process of obtaining the resource allocation system is stored by a data buffer, and the stored training data includes, but is not limited to, input instructions of a user.

Specifically, the instruction input by the user is stored as data for neural network parameter optimization in a data buffer, as shown in fig. 6, and the multiple concurrent training module stores the instruction input by the user

Depth enhanced algorithm inputAct a of drawing_tStoring the data generated by the environment into a data buffer, sampling the data from the data buffer during the training of the network model, and updating the network parameters, wherein the user commands

Is an important basis for neural network learning. So that at the time of sampling,

a higher weight will be given.

In another optional embodiment of the present invention, after step 131, the method may further include:

step 1311, the resource allocation system is operated.

As shown in fig. 7, in this embodiment, the resource allocation system includes the following six modules:

the simulation engine library module: the input is simulation environment parameters, the output is a simulation engine, and different environment parameters correspond to different engines.

II, a decision algorithm library module: the input is algorithm parameters, the output is a reinforcement learning algorithm, and different parameters correspond to different algorithms.

Third, the universal interface module: the module packages a simulation deduction engine and a reinforcement learning algorithm by referring to the gym standard, provides a universal AI training interface aiming at different engines and algorithms, and realizes the butt joint of an algorithm base module and a simulation engine module to form a set of universal packaging interface.

A multi-concurrency training module: and when the universal interface module realizes the butt joint of the algorithm and the engine, entering a multi-concurrency training module. The bottom layer of the module supports a CPU and a GPU. Synchronously starting a plurality of CPUs to execute the same program, initializing each CPU to interact with the same environment and depth enhancement algorithm, and collecting data to the CPU/GPU in real time; and when the data collection time reaches a preset condition, updating the strategy network parameters of the algorithm by the CPU/GPU, and continuously interacting with the environment by each CPU based on the updated strategy network parameters. Therefore, multi-concurrent AI training is realized, the training speed is improved, and the expandable distributed framework is supported.

Specifically, firstly, a plurality of CPUs are synchronously started to execute the same program, each CPU is initialized to be in the same environment, interacts with a depth enhancement algorithm, and collects data to a CPU/GPU in real time; secondly, when the data collection time reaches a preset stop condition, the CPU/GPU updates the strategy network parameters of the algorithm, and each CPU continues to interact with the environment based on the updated strategy network parameters.

In the multi-concurrent training module, the CPU and the GPU are combined for use, so that the utilization efficiency and scale of hardware are obviously improved, and the learning speed is improved. By adopting a multi-CPU data acquisition mode, multi-concurrent AI training is realized, the training speed is improved, meanwhile, the data correlation is broken, the data effectiveness is improved, and an extensible distributed framework is supported.

The expert system module: the input is environment situation data, and the output is expert action. The module establishes a resource target distribution expert system based on a target scene and establishes an expert system rule which is driven by knowledge based on heuristic and is suitable for the target scene. The expert system rules can be used as the reference of the traditional non-intelligent resource target allocation algorithm and the subsequent intelligent resource target allocation algorithm for research and comparison, and the difference and the similarity between the intelligent resource target allocation algorithm and the traditional method are obtained through analysis.

Sixth, man-machine cooperation module: the input is the environment situation and the output is the user action. The module realizes real-time input of user instructions in the man-machine game process through the man-machine interaction interface, learns the instructions through the neural network and optimizes the decision strategy. Meanwhile, the decision result is displayed in real time, so that the user can be conveniently assisted in human-computer interaction.

In yet another alternative embodiment of the present invention, step 1311 may comprise:

13111, running the first execution program in a preset environment to obtain a processing result of the resource allocation system in the preset environment.

In the embodiment, after the processing result is obtained, the evaluation result can be compared with the standard result given by the expert system.

In the embodiment of the invention, the resource distribution system is provided with a plurality of integrated simulation engines, is compatible with a plurality of deep reinforcement learning algorithms and expert systems, is man-machine cooperative, supports the functions of multiple concurrent trainings and the like, integrates multiple functions into a whole, is convenient for users to use, can be expanded by both the simulation engines and the algorithms, can be expanded by the multiple concurrent trainings to form a distributed type, can reuse a universal interface code, and effectively improves the development efficiency.

Fig. 8 is a schematic structural diagram of a training apparatus 80 of a resource allocation system according to an embodiment of the present invention. As shown in fig. 8, the apparatus includes:

an encapsulation module 81, configured to encapsulate the first algorithm and the first simulation engine to obtain a first execution program of the initial model;

a processing module 82, configured to run the first execution program based on the initial model to generate at least one set of situation data; executing the following training operation aiming at each configuration situation data in the at least one group of situation data until the execution result corresponding to the at least one group of situation data meets the corresponding condition to obtain the resource distribution system, wherein the resource distribution system is adapted to any execution program, the any execution program is obtained by packaging a second algorithm and a second simulation engine, the first algorithm and the second algorithm are any algorithms in a preset algorithm library, and the first simulation engine and the second simulation engine are any simulation engines in a preset simulation engine library; the training operation comprises: inputting corresponding situation data into a standard model and a model to be trained respectively to obtain a standard result output by the standard model and a reference result output by the model to be trained, wherein the model to be trained corresponds to the initial model;

and the output module 83 is configured to update the parameters of the model to be trained to obtain a new model to be trained if the preset stop condition is not met after the standard result and the reference result are obtained, and repeatedly perform the step of inputting the corresponding situation data to the standard model and the model to be trained respectively to obtain the standard result output by the standard model and the reference result output by the model to be trained.

Optionally, the encapsulating module 81 is further configured to reset the environment state of the first simulation engine based on the first algorithm to obtain a first state value;

obtaining a first action according to the first state value;

Optionally, the processing module 82 is further configured to modify the reference result to obtain a modified modification result.

Optionally, the processing module 82 further stores training data generated by the process of obtaining the resource allocation system.

Optionally, the processing module 82 is further configured to operate the resource allocation system.

Optionally, the processing module 82 is further configured to run the first execution program in a preset environment, so as to obtain a processing result of the resource allocation system in the preset environment.

It should be understood that the above description of the method embodiments illustrated in fig. 1 to 7 is merely an illustration of the technical solution of the present invention by way of alternative examples, and does not limit the training method of the resource allocation system according to the present invention. In other embodiments, the execution steps and the sequence of the training method of the resource allocation system according to the present invention may be different from those of the above embodiments, and the embodiments of the present invention do not limit this.

It should be noted that this embodiment is an apparatus embodiment corresponding to the above method embodiment, and all the implementations in the above method embodiment are applicable to this apparatus embodiment, and the same technical effects can be achieved.

An embodiment of the present invention provides a non-volatile computer storage medium, where the computer storage medium stores at least one executable instruction, and the computer executable instruction may execute a training method of a resource allocation system in any method embodiment described above.

Fig. 9 is a schematic structural diagram of a computing device according to an embodiment of the present invention, and the specific embodiment of the present invention does not limit the specific implementation of the computing device.

As shown in fig. 9, the computing device may include: a processor (processor), a Communications Interface (Communications Interface), a memory (memory), and a Communications bus.

Wherein: the processor, the communication interface, and the memory communicate with each other via a communication bus. A communication interface for communicating with network elements of other devices, such as clients or other servers. The processor is used for executing the program, and particularly can execute the relevant steps in the embodiment of the training method of the resource allocation system for the computing equipment.

In particular, the program may include program code comprising computer operating instructions.

The processor may be a central processing unit CPU or an application Specific Integrated circuit asic or one or more Integrated circuits configured to implement embodiments of the present invention. The computing device includes one or more processors, which may be the same type of processor, such as one or more CPUs; or may be different types of processors such as one or more CPUs and one or more ASICs.

And the memory is used for storing programs. The memory may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

The program may in particular be adapted to cause a processor to perform the method of training a resource allocation system in any of the method embodiments described above. For specific implementation of each step in the program, reference may be made to corresponding steps and corresponding descriptions in units in the above embodiments of the training method for a resource allocation system, which are not described herein again. It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described devices and modules may refer to the corresponding process descriptions in the foregoing method embodiments, and are not described herein again.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. In addition, embodiments of the present invention are not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of embodiments of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best modes of embodiments of the invention.

In the description provided herein, numerous specific details are set forth. It is understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

Similarly, it should be appreciated that in the foregoing description of exemplary embodiments of the invention, various features of the embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects.

Those skilled in the art will appreciate that the modules in the device in an embodiment may be adaptively changed and disposed in one or more devices different from the embodiment. The modules or units or components of the embodiments may be combined into one module or unit or component, and furthermore they may be divided into a plurality of sub-modules or sub-units or sub-components.

Furthermore, those skilled in the art will appreciate that while some embodiments herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments.

The various component embodiments of the invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art will appreciate that a microprocessor or Digital Signal Processor (DSP) may be used in practice to implement some or all of the functionality of some or all of the components according to embodiments of the present invention. Embodiments of the invention may also be implemented as apparatus or device programs (e.g., computer programs and computer program products) for performing a portion or all of the methods described herein. Such programs implementing embodiments of the present invention may be stored on computer-readable media or may be in the form of one or more signals. Such a signal may be downloaded from an internet website or provided on a carrier signal or in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that the word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. Embodiments of the invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specified otherwise.

Claims

1. A method for training a resource allocation system, the method comprising:

encapsulating the first algorithm and the first simulation engine to obtain a first executive program of the initial model, wherein the encapsulating the first algorithm and the first simulation engine to obtain the first executive program of the initial model comprises: resetting the environmental state of the first simulation engine based on the first algorithm to obtain a first state value; obtaining a first action according to the first state value; if the preset stop condition is not met after the first action is obtained, advancing a time step, obtaining a first reward value corresponding to the first state value, and repeatedly executing the step of obtaining the first state value;

the training operation comprises:

inputting each configuration situation data in the at least one group of situation data into a standard model and a model to be trained respectively to obtain a standard result output by the standard model and a reference result output by the model to be trained, wherein the model to be trained corresponds to the initial model;

2. The method for training the resource allocation system according to claim 1, further comprising, after obtaining the reference result output by the model to be trained, the following steps:

and modifying the reference result to obtain a modified modification result.

3. The method for training the resource allocation system according to claim 1, further comprising, after obtaining the resource allocation system:

4. The method of claim 3, further comprising, after storing training data generated by the process of obtaining the resource allocation system:

operating the resource allocation system.

5. The method of claim 4, wherein operating the resource allocation system comprises:

6. An apparatus for training a resource allocation system, the apparatus comprising:

the packaging module is used for packaging the first algorithm and the first simulation engine to obtain a first execution program of the initial model; resetting the environmental state of the first simulation engine based on the first algorithm to obtain a first state value; obtaining a first action according to the first state value; if the preset stop condition is not met after the first action is obtained, advancing a time step, obtaining a first reward value corresponding to the first state value, and repeatedly executing the step of obtaining the first state value;

7. A computing device, comprising: the system comprises a processor, a memory, a communication interface and a communication bus, wherein the processor, the memory and the communication interface complete mutual communication through the communication bus;

the memory is to store at least one executable instruction that when executed causes the processor to perform a method of training the resource allocation system of any of claims 1-5.

8. A computer storage medium having stored therein at least one executable instruction that when executed causes a computing device to perform a method of training a resource allocation system according to any one of claims 1-5.