CN110766169A

CN110766169A - Transfer training optimization method and device for reinforcement learning, terminal and storage medium

Info

Publication number: CN110766169A
Application number: CN201911057308.9A
Authority: CN
Inventors: 梁新乐; 刘洋; 陈天健; 董苗波
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-07

Abstract

The invention discloses a transfer training optimization method, a transfer training optimization device, terminal equipment and a computer readable storage medium for reinforcement learning, wherein training models obtained by training equipment in preset environments based on reinforcement learning are obtained; performing federal processing on each training model to generate a federal model; and migrating and adapting the federal model to each preset environment so that the training equipment of each preset environment optimizes and strengthens learning training according to the federal model. According to the invention, the training model obtained by performing reinforcement training by using the training equipment in the preset environment is fully utilized, so that the privacy of user data can be well protected, the problems of high cost and long time delay of data transmission during reinforcement learning in the traditional reinforcement learning training mode are solved, and the stability of the industrial reinforcement learning model and the overall efficiency of model training are optimized and improved.

Description

Transfer training optimization method and device for reinforcement learning, terminal and storage medium

Technical Field

The invention relates to the technical field of Fintech (financial technology), in particular to a transfer training optimization method and device for reinforcement learning, terminal equipment and a computer readable storage medium.

Background

The current practice of reinforcement learning in the industry generally includes collecting a large amount of training data from the simulation environment and the real environment, then collecting the training data to train a reinforcement learning (or other machine learning) model, and finally deploying the trained model to the simulation environment and the real environment, and continuing to collect the training data from the simulation environment and the real environment for training.

Because the simulation environment and the real environment are data collection processes, and the training and issuing processes of the reinforcement learning model are performed after the data collection is completed, and based on the requirement of reinforcement learning real-time training, the traditional reinforcement learning training mode needs to retain and transfer the data collected by the simulation environment and the real environment, and based on the limitation of data transmission bandwidth, transmission delay, user privacy and other factors, the traditional reinforcement learning training mode has poor stability and low overall training efficiency.

Disclosure of Invention

The invention mainly aims to provide a transfer training optimization method, a transfer training optimization device, terminal equipment and a computer readable storage medium for reinforcement learning, and aims to solve the technical problems of poor stability and low overall training efficiency of reinforcement learning in the conventional reinforcement learning training mode.

In order to achieve the above object, the present invention provides a migration training optimization method for reinforcement learning, including:

acquiring training models obtained by training equipment in each preset environment based on reinforcement learning training;

performing federal processing on each training model to generate a federal model;

and migrating and adapting the federal model to each preset environment so that the training equipment of each preset environment optimizes and strengthens learning training according to the federal model.

Further, the step of migrating and adapting the federal model to each preset environment includes:

reading the environmental parameters of each preset environment;

and adjusting the federal model according to the environment parameters so as to transfer and adapt the federal model to each preset environment.

Further, each of the preset environments includes: each of the simulated environments and each of the real-world environments,

the step of obtaining each training model obtained by the training equipment in each preset environment based on reinforcement learning training comprises the following steps:

detecting a storage queue of each training model finished by the training equipment of each simulation environment based on real-time reinforcement learning training, and randomly extracting each training model from the storage queue according to a preset period;

and acquiring training models of the training equipment of the real environment based on real-time reinforcement learning training according to the preset period.

Further, before the step of detecting the storage queue of each training model finished by the training device of each simulation environment based on the real-time reinforcement learning training, the method further includes:

and constructing each simulation environment corresponding to each real environment, and performing reinforcement learning training in each simulation environment in real time based on the training equipment to obtain the training model.

Further, the step of constructing each of the simulated environments corresponding to each of the real environments includes:

detecting the industrial field to which each real environment belongs;

and calling simulation software in the industrial field to construct each simulation environment, wherein the number of the constructed simulation environments is more than or equal to that of the real environments.

Further, the step of performing federal processing on each training model to generate a federal model includes:

extracting preset federal learning rules for performing federal processing on each training model, wherein the federal learning rules belong to a transverse federal learning technology;

and fusing the training models into a federal model according to the preset federal learning rule.

Further, the step of fusing each training model into a federated model includes:

reading each training model obtained at the current moment;

and fusing the obtained training models into a federal model for reinforcement learning training of the training equipment in the preset environment.

In order to achieve the above object, the present invention further provides a migration training optimization device for reinforcement learning, including:

the acquisition module is used for acquiring each training model obtained by the training equipment of each preset environment based on reinforcement learning training;

the federal module is used for performing federal processing on each training model to generate a federal model;

and the migration training module is used for performing migration adaptation on the federal model to each preset environment so as to optimize reinforcement learning training according to the federal model by the training equipment in each preset environment.

The present invention also provides a terminal device, including: a memory, a processor, and a reinforcement learning migration training optimization program stored on the memory and executable on the processor, the reinforcement learning migration training optimization program when executed by the processor implementing the steps of the reinforcement learning migration training optimization method as in the above.

The present invention also provides a computer-readable storage medium having stored thereon a computer program which, when being executed by a processor, implements the steps of the reinforcement learning migration training optimization method as described above.

According to the transfer training optimization method and device for reinforcement learning, the terminal equipment and the computer readable storage medium, training models obtained by training equipment in each preset environment based on reinforcement learning are obtained; performing federal processing on each training model to generate a federal model; and migrating and adapting the federal model to each preset environment so that the training equipment of each preset environment optimizes and strengthens learning training according to the federal model. According to the invention, the sample data which needs to be collected for the reinforcement learning of the training equipment in the preset environment is not migrated and transmitted, but the training model which is obtained by the reinforcement training of the training equipment in the preset environment is fully utilized, so that the privacy of the user data can be well protected, the problems of high cost and long time delay of data transmission when the reinforcement learning is carried out in the traditional reinforcement learning training mode are avoided, and the stability of the reinforcement learning training and the overall efficiency of the training are optimized.

Drawings

FIG. 1 is a schematic diagram of the hardware operation involved in an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a migration training optimization method for reinforcement learning according to a first embodiment of the present invention;

FIG. 3 is a flowchart illustrating a detailed process of step S100 in an embodiment of a migration training optimization method for reinforcement learning according to the present invention;

FIG. 4 is a schematic diagram of an application scenario of an embodiment of a migration training optimization method for reinforcement learning according to the present invention;

fig. 5 is a schematic structural diagram of a migration training optimization device for reinforcement learning according to the present invention.

The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.

Detailed Description

It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As shown in fig. 1, fig. 1 is a schematic structural diagram of a hardware operating environment according to an embodiment of the present invention.

It should be noted that fig. 1 is a schematic structural diagram of a hardware operating environment of the terminal device. The terminal equipment of the embodiment of the invention can be terminal equipment such as a PC, a portable computer and the like.

As shown in fig. 1, the terminal device may include: a processor 1001, such as a CPU, a network interface 1004, a user interface 1003, a memory 1005, a communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display), an input unit such as a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface, a wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (e.g., a magnetic disk memory). The memory 1005 may alternatively be a storage device separate from the processor 1001.

Those skilled in the art will appreciate that the terminal device configuration shown in fig. 1 is not intended to be limiting of the terminal device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.

As shown in fig. 1, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a distributed task processing program. Among them, the operating system is a program that manages and controls the hardware and software resources of the sample terminal device, a handler that supports distributed tasks, and the execution of other software or programs.

In the terminal apparatus shown in fig. 1, the user interface 1003 is mainly used for data communication with each terminal; the network interface 1004 is mainly used for connecting a background server and performing data communication with the background server; and the processor 1001 may be configured to invoke a reinforcement learning migration training optimization program stored in the memory 1005 and perform the following operations:

and transferring the federal model to the training equipment of each preset environment, so that each training equipment optimizes and strengthens learning training according to the federal model.

Further, the processor 1001 may invoke a reinforcement learning migration training optimization program stored in the memory 1005, and further perform the following operations:

reading the environmental parameters of each preset environment;

Further, the processor 1001 may call a reinforcement learning migration training optimization program stored in the memory 1005, and further perform the following operations before executing the storage queue of training models that detect that the training devices of the respective simulation environments complete based on the real-time reinforcement learning training:

detecting the industrial field to which each real environment belongs;

reading each training model obtained at the current moment;

Based on the above structure, various embodiments of the migration training optimization method for reinforcement learning of the present invention are provided.

Referring to fig. 2, fig. 2 is a flowchart illustrating a migration training optimization method for reinforcement learning according to a first embodiment of the present invention.

While a logical order is shown in the flow chart, in some cases, the steps shown or described may be performed in an order different than that shown.

The migration training optimization method for reinforcement learning according to the embodiment of the present invention is applied to the terminal device, and the terminal device according to the embodiment of the present invention may be a terminal device such as a PC or a portable computer, and is not limited specifically herein.

The migration training optimization method for reinforcement learning in the embodiment comprises the following steps:

and step S100, acquiring training models obtained by training equipment in each preset environment based on reinforcement learning training.

In a horizontal federal learning system formed by a federal learning server, training equipment in each simulation environment and training equipment in each real environment, the federal learning server acquires the training equipment in each simulation environment and each training model obtained by the training equipment in each real environment based on real-time reinforcement learning training in real time.

It should be noted that, in this embodiment, the preset environment includes a simulation environment and a real environment.

Further, referring to fig. 3, fig. 3 is a detailed flowchart of the step S100 of the migration training optimization method for reinforcement learning according to the present invention, and the step S100 includes:

and S101, acquiring training models of the training equipment of each simulation environment based on real-time reinforcement learning training according to a preset period.

And S102, acquiring training models of the training equipment of the real environment based on real-time reinforcement learning training according to the preset period.

It should be noted that, in this embodiment, the preset period is a time period (for example, 10 minutes) that a worker sets autonomously in advance based on the performance of the federal learning server to receive training models uploaded by training devices in each simulation environment and each real environment, and it should be understood that the migration training optimization method for reinforcement learning according to the present invention does not limit specific values of the preset period.

Specifically, for example, in one application scenario of reinforcement learning migration training shown in fig. 4, the federal learning server acquires, every 10 minutes, a training device for reinforcement learning training of a simulation environment a, a training device for reinforcement learning training of a simulation environment B, Agent (Agent, learner of reinforcement learning model) 1 for reinforcement learning training of a real environment RL (real: actual, real), and Agent2 for reinforcement learning training of the real environment RL, and a plurality of training models for reinforcement learning in the current industrial field (for example, unmanned vehicle automatic driving) each generated in real time after independent reinforcement learning training.

In this embodiment, the Agent-reinforcement learning model for reinforcement learning training of the real environment RL includes, but is not limited to, a deep reinforcement learning model (for example, DQN model, DDPG model, A3C model, PPO model, etc.), and it should be understood that the migration training optimization method for reinforcement learning according to the present invention does not limit the type of the reinforcement learning model in the real environment.

Further, in this embodiment, before obtaining, according to a preset period, each training model of the training device in each simulation environment, which is completed based on the real-time reinforcement learning training, in step S101, the migration training optimization method for reinforcement learning according to the present invention further includes:

step A, constructing each simulation environment corresponding to each reality environment, and performing reinforcement learning training in each simulation environment in real time based on the training equipment to obtain the training model.

The method comprises the steps of simulating training equipment for reinforcement learning of a real environment based on calling of simulation software to establish a plurality of simulation environments corresponding to the current real environment, and performing single-machine training independently in real time based on the training equipment for reinforcement learning training in the plurality of established simulation environments to generate a training model for reinforcement learning training of each established simulation environment.

It should be noted that, in this embodiment, when the training device performing the reinforcement learning training (i.e., the reinforcement learning Agent in the simulation environment) performs the reinforcement learning training in the simulation environment, a certain amount of sample data is randomly extracted from the experience memory queue of fixed length storing the data blocks collected in the real environment to perform the reinforcement learning training, so that the influence of the data timing sequence between the data collected in the real environment on the accuracy of the model training result is avoided, and the robustness of the reinforcement learning model is improved.

In the embodiment, the reinforcement learning training is performed based on the establishment of a plurality of simulation environments associated with the real environment, so that the robustness of the reinforcement learning model is improved, and the training speed of the reinforcement learning model is accelerated.

It should be noted that, in this embodiment, because the cost required for constructing the simulation environment is low, when the simulation environment is constructed for the reinforcement learning training of the simulation environment, a plurality of simulation environments may be constructed at the same time to increase the number of concurrent training devices for the reinforcement learning training of the simulation environment, so as to obtain a more robust training model, and improve the adaptability of the reinforcement learning model to different environments while saving the training time and the training cost.

Further, in this embodiment, in the step a, the step of constructing each simulation environment corresponding to each real environment includes:

step A1, detecting the industrial field to which each of the real environments belongs.

When a simulation environment corresponding to the real environment is constructed, the industrial field to which the current real environment belongs is detected and identified.

In this embodiment, the industrial fields include, but are not limited to, the control of an industrial robot, the control of an unmanned vehicle, the control of an AGV cart, the control of an unmanned vehicle, the control of a sweeping robot, the optimization control of industrial process production equipment (including, but not limited to, the combustion optimization of a power generation boiler, the optimization control of a rectifying tower, the control of a steel blast furnace, and the like).

Step A2, calling simulation software in the industrial field to construct each simulation environment.

After the industrial field to which the real environment belongs is detected, calling mature simulation software in the industrial field to construct the simulation environment corresponding to the real environment.

It should be noted that, in this embodiment, in order to increase the number of concurrent training devices for performing reinforcement learning training on a simulation environment, so as to obtain a more robust training model, when the simulation environment is constructed, simulation environments with the same number as or a larger number than that of real environments may be constructed for performing the model training of reinforcement learning.

Specifically, for example, when the industrial field to which the real environment belongs is detected and recognized to be the unmanned vehicle automatic driving field, the automatic driving software Airsim, cara, deep drive and the like are called to construct the simulation model of the current real environment.

And step S200, carrying out federal processing on each training model to generate a federal model.

In a transverse federal learning system formed by a federal learning server, various simulation environment training devices and various real environment training devices, the federal learning server carries out federal processing on various training models obtained by the acquired simulation environment training devices and the real environment training devices based on real-time reinforcement learning training according to preset federal learning rules, and therefore a plurality of training models are fused to generate the federal model.

Specifically, for example, in an application scenario of reinforcement learning migration training shown in fig. 4, the federal learning server collects training models obtained by performing reinforcement learning training on each simulation environment and training models obtained by performing reinforcement learning training on each real environment periodically, and fuses the collected training models according to the federal learning rules to generate a new federal model for performing reinforcement learning training on training devices (i.e., agents for reinforcement learning) in each simulation environment and each real environment.

Step S300, the federal model is migrated and adapted to each preset environment, so that the training equipment of each preset environment can optimize and strengthen learning training according to the federal model.

In a transverse federal learning system formed by a federal learning server, various simulation environment training devices and various real environment training devices, the federal learning server transfers and adapts federal models obtained by federal processing according to federal learning rules to various simulation environments and various real environments, so that the training devices in different simulation environments and real environments perform model training on respective simulation environments or reinforcement learning models in real environments based on the transferred and adapted federal models, and the effect of updating the whole reinforcement learning models in the current real environments is achieved.

It should be noted that, in this embodiment, the federate learning server performs federate processing on training models in different environments (a simulation environment and a real environment), and migrates and issues the federate model obtained through the federate processing to the reinforcement learning agents in each environment to update the reinforcement learning models, which may be performed in an "asynchronous" manner, that is, the federate learning server aggregates the training models obtained at the current time in a fixed period, instead of waiting for the reinforcement learning agents in all environments (including all simulation environments and all real environments) to complete uploading of the models, and the reinforcement learning agents in different environments also perform training and updating on the reinforcement learning models independently in their respective fixed periods, so that additional communication and calculation caused by performing the aggregation of the training models synchronously or performing the training and updating of the reinforcement learning models are avoided, the training time of the reinforcement learning model is saved, and the whole training efficiency is optimized and improved.

Further, in step S300, the step of transferring and issuing the federal model to each preset environment includes:

step S301, reading the environmental parameters of each preset environment.

Reading all the environmental parameters of the simulation environment and the real environment, wherein the read environmental parameters include but are not limited to: environmental characteristics, environmental training reward functions, and environmental training output controls.

Step S302, the federal model is adjusted according to the environment parameters so as to be migrated and adapted to each preset environment.

The method comprises the steps of taking any one of all simulation environments and all real environments as a standard environment, adjusting local migration training models of other training environments to be matched with the current standard environment, carrying out migration adjustment on a federal model obtained by carrying out federal processing on each training model by a federal learning server according to preset federal learning rules by using environmental parameters such as environmental characteristics, an environmental training reward function and environmental training output control corresponding to the standard environment, and enabling running parameters of the federal model to be matched with any one of all simulation environments and all real environments.

In this embodiment, in a horizontal federal learning system formed by a federal learning server, training devices in various simulation environments and training devices in various real environments, the federal learning server acquires training devices in various simulation environments and training devices in various real environments in real time and obtains training models obtained by reinforcement learning training in real time, the federal learning server performs federal processing on the training devices in various simulation environments and the training devices in various real environments obtained by the training devices in various real environments based on the reinforcement learning training in real time according to preset federal learning rules, so that a plurality of training models are fused and generated into a federal model, the federal learning server migrates and adapts the federal model obtained by performing federal processing according to the federal learning rules to various simulation environments and various real environments, and the training devices in different simulation environments and real environments perform migration and adaptation on the federal models in respective simulation environments based on the migrated and adapted federate models Or the reinforcement learning model of the real environment is subjected to model training, so that the effect of updating the model of the whole reinforcement learning model of the current real environment is realized.

The method and the device realize real-time transfer training of knowledge by combining the simulation environment and the real environment, transfer transmission is not performed on sample data which needs to be collected when the training equipment performs reinforcement learning in the preset environment, but the training model which is obtained by performing reinforcement training by the training equipment in the preset environment is fully utilized, so that not only can the privacy of user data be well protected, but also a large amount of data transmission cost (network bandwidth cost and time cost) is saved, and the robustness of reinforcement learning and the overall efficiency of model training are optimized and improved.

Further, a second embodiment of the migration training optimization method for reinforcement learning according to the present invention is provided based on the first embodiment of the migration training optimization method for reinforcement learning.

In a second embodiment of the migration training optimization method for reinforcement learning of the present invention, in the step S200, performing federal processing on each training model to generate a federal model, includes:

step S201, extracting a preset federal learning rule for performing federal processing on each training model.

And S202, fusing the training models into a federal model according to the preset federal learning rule.

Before the federal learning server starts to carry out federal processing on each obtained training model, the federal learning server extracts a federal learning rule from a currently constructed horizontal federal learning system, and then fuses and processes each training model federal into a federal model for reinforcement learning training of each simulation environment and each reality environment according to the extracted federal learning rule.

Further, in step S202, the step of fusing each training model into a federal model includes:

step S2021, reads each of the training models acquired at the current time.

Step S2022, fusing the obtained training models into a federal model for reinforcement learning training of the training devices in the preset environments.

Specifically, for example, the federal learning server detects, according to a fixed period (e.g., 10 minutes), training models obtained by local reinforcement learning training from each simulation environment and each real environment, which have been acquired at the present time, and then performs federal processing of weighted average on the training models acquired in the present period according to a weight coefficient specified in the extracted federal learning rule, so that the training models are fused and processed into a federal model suitable for model training of reinforcement learning models in each simulation environment and each real environment.

It should be noted that, in this embodiment, the preset federal learning rule belongs to a horizontal federal learning technology, and since the federal learning server, each simulation environment training device, and each real environment training device are constructed to form a horizontal federal learning system, the federal learning rule according to which the federal learning server performs federal processing on each acquired training model belongs to the horizontal federal learning technology, for example, the preset federal learning rule may specifically be a process of fusing training models for reinforcement learning of different environments into one federal model, and the federal learning rule includes, but is not limited to, a Fed-AVG algorithm, a Trimmed-mean SGD, and the like.

It should be understood that when the federal learning server, each simulation environment training device and each real environment training device are constructed into other federal learning systems, the preset federal learning rule is correspondingly adjusted, and the migration training optimization method for reinforcement learning does not limit the federal learning mode and the federal learning rule according to the migration training optimization for reinforcement learning.

In this embodiment, on the basis of a horizontal federal learning rule under a horizontal federal learning system constructed by a federal learning server, training devices (reinforcement learning agents) in each simulation environment and each real environment, federal learning is performed on a plurality of training models acquired by the federal learning server, so that a federal model for performing model training on reinforcement learning models in each simulation environment and each real environment is generated, migration and fusion of training models obtained by reinforcement learning training of each simulation environment and each real environment are realized, and the robustness of the whole reinforcement learning model and the model training speed are improved.

Further, a third embodiment of the migration training optimization method for reinforcement learning according to the present invention is provided based on the first embodiment of the migration training optimization method for reinforcement learning.

In a third embodiment of the migration training optimization method for reinforcement learning of the present invention, in the step S300, the federal model is migrated and adapted to each preset environment, so that the training device in each preset environment performs reinforcement learning training according to the federal model, further including:

and step B, packaging the federal model into an operation instruction, and transferring and issuing the operation instruction to each training device so that each training device can start to perform reinforcement learning training according to the operation instruction.

In a horizontal federal learning system constructed by a federal learning server, training devices (reinforcement learning agents) of various simulation environments and various real environments, the federal learning server packages a federal model obtained through federal processing in real time into an operation instruction for controlling the start and operation of the training devices, so that under the condition that one of the training devices of various simulation environments and various real environments does not carry out local training of the reinforcement learning model before the current moment, the training device can take the federal model transferred and issued by the federal learning server as an initial sample model according to the operation instruction, and starts to start to operate the simulation environment or the real environment according to the federal model to carry out reinforcement learning training.

And step C, transferring and issuing the federal model to each training device so that each training device can continue to perform a new round of reinforcement learning training according to the received federal model.

In a horizontal federal learning system constructed by a federal learning server, each simulation environment and training equipment (reinforcement learning Agent) of each real environment, the federal learning server transfers and issues a federal model obtained by federal processing in real time, so that under the condition that local training of the reinforcement learning model is circularly performed before the current moment in one of the training equipment of each simulation environment and each real environment, the training equipment takes the federal model transferred and issued by the federal learning server as a new sample model, and reinforcement learning training is continuously performed on the simulation environment or the real environment according to the federal model.

In this embodiment, by detecting whether a training device (reinforcement learning Agent) performing reinforcement learning in different environments performs local training on a reinforcement learning model before receiving a federal model transferred and issued by a federal learning server, reinforcement learning training is started or continued to be performed on respective environments based on the received federal model, so that flexibility of reinforcement learning model training and model training efficiency are improved.

In addition, referring to fig. 5, an embodiment of the present invention further provides a migration training optimization device for reinforcement learning, which includes:

Preferably, the migration training module comprises:

the reading unit is used for reading the environmental parameters of each preset environment;

and the adaptation unit is used for adjusting the federal model according to the environment parameters so as to transfer and adapt the federal model to each preset environment.

Preferably, the acquisition module comprises:

the first acquisition unit is used for acquiring each training model of the training equipment of each simulation environment based on real-time reinforcement learning training according to a preset period;

and the second acquisition unit is used for acquiring each training model finished by the training equipment of the real environment based on real-time reinforcement learning training according to the preset period.

Preferably, the obtaining module further includes:

and the construction unit is used for constructing each simulation environment corresponding to each real environment and carrying out reinforcement learning training in real time based on the training equipment in each simulation environment to obtain the training model.

Preferably, the building unit comprises:

the detection subunit is used for detecting the industrial field to which each real environment belongs;

and the calling subunit is used for calling simulation software in the industrial field to construct each simulation environment, wherein the number of the constructed simulation environments is more than or equal to that of the real environments.

Preferably, the federal module, comprises:

the extraction unit is used for extracting preset federal learning rules for performing federal processing on each training model, wherein the federal learning rules belong to a transverse federal learning technology;

and the processing unit is used for fusing the training models into a federal model according to the preset federal learning rule.

Preferably, the processing unit further comprises:

the reading subunit is used for reading each training model obtained at the current moment;

and the processing subunit is used for fusing the obtained training models into a federal model for reinforcement learning training of the training equipment in each preset environment.

In addition, an embodiment of the present invention further provides a transfer training optimization device for reinforcement learning, which is applied to a training device for reinforcement learning training in each preset environment, and the transfer training optimization device for reinforcement learning includes:

the model training module is used for starting to operate to carry out reinforcement learning training when the current training equipment receives the federal model of the reinforcement learning training; or,

and the model training module is also used for continuing reinforcement learning training on the current training equipment according to the received federal model.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, which is applied to a computer, and the computer-readable storage medium may be a non-volatile computer-readable storage medium, on which a reinforcement learning migration training optimization program is stored, and when being executed by a processor, the reinforcement learning migration training optimization program implements the steps of the reinforcement learning migration training optimization method described above.

The steps implemented when the migration training optimization program for reinforcement learning running on the processor is executed may refer to various embodiments of the migration training optimization method for reinforcement learning of the present invention, and are not described herein again.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present invention.

The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims

1. A reinforcement learning migration training optimization method is characterized by comprising the following steps:

and migrating and adapting the federal model to each preset environment so that training equipment of each preset environment optimizes and strengthens learning training according to the federal model.

2. The reinforcement learning migration training optimization method of claim 1, wherein the step of migrating and adapting the federated model to each of the predefined environments comprises:

reading the environmental parameters of each preset environment;

3. The reinforcement learning migration training optimization method of claim 1, wherein each of the preset environments comprises: each of the simulated environments and each of the real-world environments,

acquiring training models of the training equipment of each simulation environment based on real-time reinforcement learning training according to a preset period;

4. The reinforcement learning migration training optimization method according to claim 3, wherein before the step of obtaining training models of the training devices of the simulation environments based on the real-time reinforcement learning training according to the preset period, the method further comprises:

5. The reinforcement learning migration training optimization method of claim 4, wherein the step of constructing each of the simulated environments corresponding to each of the real-world environments comprises:

detecting the industrial field to which each real environment belongs;

6. The reinforcement learning migration training optimization method of claim 1, wherein the step of federating each of the training models to generate federated models comprises:

7. The reinforcement learning migration training optimization method of claim 6, wherein the step of fusing the training models into a federated model comprises:

reading each training model obtained at the current moment;

8. A reinforcement learning migration training optimization apparatus, comprising:

and the migration training module is used for performing migration adaptation on the federal model to each preset environment so that the training equipment in each preset environment optimizes and strengthens learning training according to the federal model.

9. A terminal device, characterized in that the terminal device comprises: a memory, a processor, and a reinforcement learning migration training optimization program stored on the memory and executable on the processor, the reinforcement learning migration training optimization program when executed by the processor implementing the steps of the reinforcement learning migration training optimization method of any one of claims 1 to 7.

10. A storage medium, characterized in that the storage medium has stored thereon a computer program which, when being executed by a processor, carries out the steps of the reinforcement learning migration training optimization method according to any one of claims 1 to 7.