CN111652371A - Offline reinforcement learning network training method, device, system and storage medium - Google Patents

Offline reinforcement learning network training method, device, system and storage medium Download PDF

Info

Publication number
CN111652371A
CN111652371A CN202010479469.3A CN202010479469A CN111652371A CN 111652371 A CN111652371 A CN 111652371A CN 202010479469 A CN202010479469 A CN 202010479469A CN 111652371 A CN111652371 A CN 111652371A
Authority
CN
China
Prior art keywords
network
action
risk
sample data
updating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010479469.3A
Other languages
Chinese (zh)
Inventor
詹仙园
徐浩然
张玥
霍雨森
朱翔宇
李春洋
邓欣
郑宇�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong City Beijing Digital Technology Co Ltd
Original Assignee
Jingdong City Beijing Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong City Beijing Digital Technology Co Ltd filed Critical Jingdong City Beijing Digital Technology Co Ltd
Priority to CN202010479469.3A priority Critical patent/CN111652371A/en
Publication of CN111652371A publication Critical patent/CN111652371A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Abstract

The invention relates to an offline reinforcement learning network training method, device, system and storage medium. The method comprises the following steps: updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network; acquiring the distribution similarity of the distribution of the sample data and the distribution of the action network; and updating the action network based on the reward network, the risk network and the distribution similarity. The embodiment of the invention updates the corresponding reward network and risk network according to the sample data and the action network, so that the reward network and the risk network adapt to the action network, obtains the distribution similarity of the distribution of the sample data and the distribution of the output result of the action network, finishes the updating of the action network based on the evaluation of the reward network and the risk network on the action network and the distribution similarity of the sample data and the action network, and finishes the optimization of the action network after circulating the preset times of the steps.

Description

Offline reinforcement learning network training method, device, system and storage medium
Technical Field
The invention relates to the technical field of big data processing, in particular to an offline reinforcement learning network training method, device and system and a storage medium.
Background
Most Reinforcement Learning (RL) algorithms learn good strategies, such as in the gaming and robotics domains, only after a significant number of attempts and errors have been experienced in the simulation environment. But in real-world scenarios (e.g. autonomous driving of cars, complex industrial system control), we do not have a perfect simulated environment, we have only a collection of pre-collected environmental interaction data, which also includes some unsafe attempts.
Therefore, how to train a strategy that maximizes long-term rewards and satisfies security constraints from the offline data is a problem to be solved urgently.
Disclosure of Invention
In order to solve the problems in the prior art, at least one embodiment of the present invention provides an offline reinforcement learning network training method, apparatus, system and storage medium.
In a first aspect, an embodiment of the present invention provides an offline reinforcement learning network training method, where the method includes:
acquiring sample data;
updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network;
obtaining the distribution similarity of the distribution of the sample data and the distribution of the action network;
updating the action network based on the reward network, the risk network and the distribution similarity, and acquiring the updating times of the action network;
and when the updating times are less than or equal to a preset threshold value, updating the action network again until the updating times are greater than the preset threshold value.
Based on the above technical solutions, the embodiments of the present invention may be further improved as follows.
With reference to the first aspect, in a first embodiment of the first aspect, the updating network parameters of a reward network and a risk network of the action network according to the sample data and a current action network includes:
calculating a first optimized network parameter for the bonus network by calculating:
Figure BDA0002516827570000021
wherein phi isRFor a first optimized network parameter of the reward network, argmin () is a function of the minimum of a function, r is a single step reward value in the sample data, γ is an attenuation coefficient in a reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkRIn order to be able to reward the network,
Figure BDA0002516827570000022
is different from at+1Reward network Q ofRMaximum value of (d);
updating the network parameters of the reward network according to the first optimized network parameters;
calculating a second optimized network parameter for the risk network by calculating:
Figure BDA0002516827570000023
wherein phi isCFor a second optimized network parameter of the risk network, argmin () is a function of the minimum of the calculation function, c is the single-step risk value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkCIn order for the risk network to be said,
Figure BDA0002516827570000024
is different from at+1Risk network ofCThe expected value of (d); a. theπIs the action network;
and updating the network parameters of the risk network according to the second optimized network parameters.
With reference to the first aspect, in a second embodiment of the first aspect, the obtaining the distribution similarity between the distribution of the sample data and the distribution of the action network includes:
and calculating the distance between the distribution of the sample data and the distribution of the action network based on an inverse phase relative entropy distance algorithm to serve as the distribution similarity.
With reference to the first aspect or the first or second embodiment of the first aspect, in a third embodiment of the first aspect, the updating the action network based on the reward network, risk network, and distribution similarity includes:
obtaining a third optimized network parameter of the action network by the following calculation mode:
φπ=argmax[QR-τ×QC+L];
τ=argmin|Q-D|;
wherein phi isπA third optimized network parameter, Q, for the action networkRFor said bonus network, QCFor the risk network, L is the distribution similarity, τ is LagA Langmian coefficient; d, setting a threshold value for the risk network;
updating the action network by the third optimized network parameter.
In a second aspect, an embodiment of the present invention provides an offline reinforcement learning network training apparatus, where the apparatus includes:
an acquisition unit configured to acquire sample data;
the first updating unit is used for updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network;
the processing unit is used for acquiring the distribution similarity of the distribution of the sample data and the distribution of the action network;
the second updating unit is used for updating the action network based on the reward network, the risk network and the distribution similarity and acquiring the updating times of the action network; and when the updating times are less than or equal to a preset threshold value, obtaining the sample data again through the first obtaining unit, and updating the action network again until the updating times are more than the preset threshold value.
With reference to the second aspect, in a first embodiment of the second aspect, the first updating unit is specifically configured to calculate the first optimized network parameter of the bonus network: updating the network parameters of the reward network according to the first optimized network parameters;
wherein the network parameters of the bonus network are calculated by the following calculation method:
Figure BDA0002516827570000041
wherein phi isRFor a first optimized network parameter of the reward network, argmin () is a function of the minimum of a function, r is a single step reward value in the sample data, γ is an attenuation coefficient in a reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value obtained by the action network,st+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkRIn order to be able to reward the network,
Figure BDA0002516827570000042
is different from at+1Reward network Q ofRMaximum value of (d);
the first updating unit is specifically configured to calculate a second optimized network parameter of the risk network; updating the network parameters of the risk network according to the second optimized network parameters;
wherein the network parameters of the risk network are calculated by the following calculation method:
Figure BDA0002516827570000043
wherein phi isCFor a second optimized network parameter of the risk network, argmin () is a function of the minimum of the calculation function, c is the single-step risk value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkCIn order for the risk network to be said,
Figure BDA0002516827570000044
is different from at+1Risk network ofCThe expected value of (d); a. theπIs the action network.
With reference to the second aspect, in a second embodiment of the second aspect, the processing unit is specifically configured to calculate, as the distribution similarity, a distance between the distribution of the sample data and the distribution of the action network based on an inverse relative entropy distance algorithm.
With reference to the second aspect or the first or second embodiment of the second aspect, in a third embodiment of the second aspect, the second updating unit is specifically configured to calculate a third optimized network parameter of the action network; updating the action network by the third optimized network parameter.
Wherein the third optimized network parameter of the action network is calculated by the following calculation method:
φπ=argmax[QR-τ×QC+L];
τ=argmin|Q-D|;
wherein phi isπA third optimized network parameter, Q, for the action networkRFor said bonus network, QCFor the risk network, L is the distribution similarity, and tau is a Lagrange coefficient; and D, setting a threshold value for the risk network.
In a third aspect, an example of the present invention provides an offline reinforcement learning network training system, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement any of the offline reinforcement learning network training methods of the first aspect when executing a program stored in the memory.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the offline reinforcement learning network training method according to any one of the first aspects.
Compared with the prior art, the technical scheme of the invention has the following advantages: the embodiment of the invention updates the corresponding reward network and risk network according to the sample data and the action network, so that the reward network and the risk network adapt to the action network, obtains the distribution similarity of the distribution of the sample data and the distribution of the output result of the action network, finishes the updating of the action network based on the evaluation of the reward network and the risk network on the action network and the distribution similarity of the sample data and the action network, and finishes the optimization of the action network after circulating the preset times of the steps.
Drawings
Fig. 1 is a schematic flowchart of an offline reinforcement learning network training method according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating an offline reinforcement learning network training method according to another embodiment of the present invention;
fig. 3 is a flowchart illustrating a method for training an offline reinforcement learning network according to another embodiment of the present invention;
fig. 4 is a flowchart illustrating a second method for training an offline reinforcement learning network according to another embodiment of the present invention;
FIG. 5 is a schematic structural diagram of an offline reinforcement learning network training apparatus according to another embodiment of the present invention;
fig. 6 is a schematic structural diagram of an offline reinforcement learning network training system according to yet another embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.
As shown in fig. 1, an embodiment of the present invention provides an offline reinforcement learning network training method. Referring to fig. 1, the method includes the steps of:
and S11, acquiring sample data.
In this embodiment, sample data may be obtained from offline data, a corresponding neural network model is obtained based on the sample data training, that is, an action network in the present application, in a traditional action-reward network (operator-critical) framework, an Actor-critical algorithm is divided into two parts, and we can separately see that the predecessor of an Actor is a policy gradient algorithm, and can easily select a suitable action in a continuous action space, and the qlearing of value-based explodes because the space is too large, but the learning efficiency is slow because the Actor is updated based on the round, and we find that a single-step update can be realized by using a value-based algorithm as critical, so that the two algorithms complement each other to form our Actor-critical model, and the sample data in the present application may be initial data of any reinforcement learning network, for example, the present invention is not limited to the environment interaction data in the fields of an autonomous vehicle, a game, and a robot.
Reinforcement learning, english name: reinforcement learning is a field of machine learning, emphasizing how to select an optimal action strategy based on environmental conditions to achieve maximum expected profit. The reinforcement learning task corresponds to a quadruple. Where a state space is represented, each state being a description of the environment perceived by the system; the actions (actions) that the system can take constitute an action space; if some action acts on the current state, the underlying transfer function t (transfer probability) will cause the environment to transfer with some probability from the current state to another state; while transitioning to another state, the environment feeds back to the system a reward based on a potential reward (reward) function.
And S12, updating the network parameters of the reward network and the risk network of the action network according to the sample data and the current action network.
In this embodiment, in the Actor-Critic framework, after inputting sample data into the action network, a corresponding output result is obtained, but the action network may not be the optimal network from the beginning, at this time, for the output result and the real result in the sample data, the reward network and the risk network of the action network generate a reward value and a risk value, and by the reward value and the risk value, the difference between the distribution of the action network and the distribution of the sample data can be laterally determined, for example, the action network obtains an output result for the sample data, and the reward network and the risk network give a rating value and a risk value for the action of the action network, i.e. the more the output result is consistent with the real result in the sample data, the greater the rating value should be, and the smaller the risk value should be, if the reward network and the risk network in the actual situation do not give correct reward value and risk value, at this time, the reward network and the risk network can be updated, and the action of the current action network can be accurately evaluated by the reward network and the risk network.
In this embodiment, the reward network and the risk network may be updated according to a result obtained after each sample data is input into the action network, so that the reward value output by the reward network is at a maximum value when the risk value output by the risk network is within a preset threshold value; or updating the reward network and the risk network corresponding to the action network according to the difference between the output result obtained after the sample data is input into the action network and the real result in the sample data,
and S13, acquiring the distribution similarity of the sample data and the action network.
In this embodiment, in the offline reinforcement learning algorithm, the action network and the sample data are required to have a distribution similar to each other, in this embodiment, the similarity value between the distribution of the sample data and the distribution of the action network may be calculated by a cosine value, specifically, the distribution of the sample data and the distribution of the action network may be converted into a corresponding vector, the cosine value of the corresponding vector may be calculated as the distribution similarity between the distribution of the sample data and the distribution of the action network, and the relative entropy distance between the sample data and the action network may also be calculated as the distribution similarity.
And S14, updating the action network based on the reward network, the risk network and the distribution similarity, and acquiring the updating times of the action network.
In this embodiment, the action network is updated according to the reward network, the risk network and the distribution similarity, and since the reward value of the reward network is the maximum value when the risk value of the risk network is the minimum value, and the action network and the distribution of the sample data represented by the distribution similarity of the sample data are the most similar finally, the action network can be updated by determining whether the reward value of the reward network, the risk value of the risk network and the distribution similarity all satisfy corresponding preset conditions in this step.
In this embodiment, the updated action network should enable the risk value output by the risk network to be within the preset risk value, the reward value output by the reward network to be the maximum value, and the distribution similarity to be the maximum value, or enable the risk value minimum value output by the risk network, the reward value output by the reward network at the same time, and the distribution similarity to be the maximum value.
And S15, when the updating times are less than or equal to the preset threshold value, updating the action network again until the updating times are more than the preset threshold value.
In the embodiment, the training of the action network can be completed by executing the steps for the preset times, the strategy requirement on the collected data is loose, the robustness is realized, and the method conforms to the actual application scene.
In this embodiment, step S13 specifically includes: and calculating the distance between the distribution of the sample data and the distribution of the action network based on an inverse phase relative entropy distance algorithm to serve as the distribution similarity.
As shown in fig. 2, an embodiment of the present invention provides an offline reinforcement learning network training method. Compared with the training method shown in fig. 1, the difference is that the rewarding network parameters of the action network are updated according to the sample data and the current action network, and the method specifically comprises the following steps:
s21, calculating the first optimized network parameter of the reward network by the following calculation mode:
Figure BDA0002516827570000091
wherein phi isRFor rewarding the first optimized network parameter of the network, argmin () is a function of the minimum of the function, r is the single step reward value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, stIs a sample at time tState value of data, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkRIn order to reward the network(s),
Figure BDA0002516827570000092
is different from at+1Reward network Q ofRMaximum value of (d);
and S22, updating the network parameters of the bonus network according to the first optimized network parameters.
In this embodiment, the updating of the parameters of the reward network is completed through the above formula, so that the reward network conforms to the sample data and the action network, that is, the reward value output by the reward network after the sample data is input can be used for evaluating the output result of the sample data input action network.
In this embodiment, the single-step reward value r is a value that can be obtained in the Actor-critical framework, and the above embodiments are described accordingly, which will not be described herein again because the action value at+1Is based on the state value s of the sample datat+1The input action network is obtained, and the sample data has at least one group, so that different a in the above formula can be obtainedt+1Reward network Q ofRIs measured.
As shown in fig. 3, an embodiment of the present invention provides an offline reinforcement learning network training method. Compared with the training method shown in fig. 1, the difference is that the risk network parameters of the action network are updated according to the sample data and the current action network, and the method specifically includes the following steps:
s31, calculating a second optimized network parameter of the risk network by the following calculation mode:
Figure BDA0002516827570000101
wherein phi isCFor the second optimized network parameter of the risk network, argmin () is a function of the minimum of the calculation function,c is the single step risk value in the sample data, gamma is the attenuation coefficient in the reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkCIn order to be a risk network,
Figure BDA0002516827570000102
is different from at+1Risk network ofCThe expected value of (d); a. theπIs an action network.
And S32, updating the network parameters of the risk network according to the second optimized network parameters.
In this embodiment, each parameter in the present scheme is similar to the parameters in the above embodiments, and details are not repeated in this step.
In this embodiment, in updating the risk network, the expected value of the risk network at the next moment is used, which is done for three reasons, 1) the definition of the cumulative risk and the jackpot are different, in which the goal is to maximize the jackpot and make the cumulative risk less than a given threshold, instead of minimizing the cumulative risk; 2) compared with the online learning situation, it is difficult to evaluate the accumulated risk value in the offline learning, so that the state-action pair (s, a) with the smallest accumulated risk value is most likely to be out of the data distribution range, and a great error is introduced to evaluate the accumulated risk value of the state pair. Thus, using the desired format can greatly mitigate the problem of inaccurate estimates of cumulative risk values.
As shown in fig. 4, an embodiment of the present invention provides an offline reinforcement learning network training method. Compared with the training method shown in fig. 1, the difference is that the action network is updated based on the reward network, the risk network and the distribution similarity, and the method comprises the following steps:
s41, obtaining a third optimized network parameter of the action network by the following calculation mode:
φπ=argmax[QR-τ×QC+L];
τ=argmin|Q-D|;
wherein phi isπThird optimized network parameter, Q, for action networkRFor rewarding the network, QCIs a risk network, L is distribution similarity, and tau is a Lagrange coefficient; d, setting a threshold value for the risk network;
and S42, updating the action network through the third optimized network parameter.
In this embodiment, a third optimized network parameter is obtained through calculation, and the network parameter of the action network is updated through the third optimized network parameter, so as to complete a one-time optimization process of the action network. A Lagrange coefficient is introduced into the formula, a large belt constrained problem is converted into an unconstrained optimization problem through a Lagrange relaxation method, and a random gradient descent algorithm is used for updating the strategy and the Lagrange coefficient.
As shown in fig. 5, an embodiment of the present invention provides an offline reinforcement learning network training apparatus, and the system includes: an acquisition unit 11, a first updating unit 12, a processing unit 13 and a second updating unit 14.
In this embodiment, the obtaining unit 11 is configured to obtain sample data;
in this embodiment, the first updating unit 12 is configured to update network parameters of a reward network and a risk network of an action network according to sample data and a current action network;
in this embodiment, the processing unit 13 is configured to obtain a distribution similarity between a distribution of sample data and a distribution of an action network;
in this embodiment, the second updating unit 14 is configured to update the action network based on the reward network, the risk network, and the distribution similarity, and obtain the update times of the action network; when the update frequency is less than or equal to the preset threshold, the first obtaining unit 11 obtains the sample data again, and updates the action network again until the update frequency is greater than the preset threshold.
In this embodiment, the first updating unit 12 is specifically configured to calculate the first optimized network parameter of the bonus network: updating the network parameters of the reward network according to the first optimized network parameters;
the network parameters of the reward network are calculated in the following calculation mode:
Figure BDA0002516827570000121
wherein phi isRFor rewarding the first optimized network parameter of the network, argmin () is a function of the minimum of the function, r is the single step reward value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkRIn order to reward the network(s),
Figure BDA0002516827570000122
is different from at+1Reward network Q ofRMaximum value of (d);
a first updating unit 12, specifically configured to calculate a second optimized network parameter of the risk network; updating the network parameters of the risk network according to the second optimized network parameters;
wherein, the network parameters of the risk network are calculated by the following calculation mode:
Figure BDA0002516827570000123
wherein phi isCFor the second optimized network parameter of the risk network, argmin () is the function of the minimum of the computation function, c is the single step risk value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkCAs risk netThe combination of the ingredients of the Chinese medicinal preparation,
Figure BDA0002516827570000131
is different from at+1Risk network ofCThe expected value of (d); a. theπIs an action network.
In this embodiment, the processing unit 13 is specifically configured to calculate, as the distribution similarity, a distance between the distribution of the sample data and the distribution of the action network based on an inverse phase relative entropy distance algorithm.
In this embodiment, the second updating unit 14 is specifically configured to calculate a third optimized network parameter of the action network; and updating the action network through the third optimized network parameter.
Wherein the third optimized network parameter of the action network is calculated by the following calculation mode:
φπ=argmax[QR-τ×QC+L];
τ=argmin|Q-D|;
wherein phi isπThird optimized network parameter, Q, for action networkRFor rewarding the network, QCIs a risk network, L is distribution similarity, and tau is a Lagrange coefficient; and D, setting a threshold value for the risk network.
As shown in fig. 6, an embodiment of the present invention provides an offline reinforcement learning network training system, which includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, wherein the processor 1110, the communication interface 1120, and the memory 1130 complete communication with each other through the communication bus 1140;
a memory 1130 for storing computer programs;
the processor 1110, when executing the program stored in the memory 1130, implements the offline reinforcement learning network training method as follows:
acquiring sample data;
updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network;
acquiring the distribution similarity of the distribution of the sample data and the distribution of the action network;
updating the action network based on the reward network, the risk network and the distribution similarity, and acquiring the updating times of the action network;
and when the updating times are less than or equal to the preset threshold, updating the action network again until the updating times are greater than the preset threshold.
In the electronic device provided by the embodiment of the present invention, the processor 1110 updates the corresponding reward network and risk network according to the sample data and the action network by executing the program stored in the memory 1130, so that the reward network and the risk network adapt to the action network, obtains the distribution similarity between the distribution of the sample data and the distribution of the output result of the action network, completes the update of the action network based on the evaluation of the reward network and the risk network on the action network and the distribution similarity between the sample data and the action network, and completes the optimization of the action network after circulating the preset times of the above steps.
The communication bus 1140 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface 1120 is used for communication between the electronic device and other devices.
The memory 1130 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 1130 may also be at least one memory device located remotely from the processor 1110.
The processor 1110 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
The embodiment of the present invention provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the offline reinforcement learning network training method according to any of the above embodiments.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (ssd)), among others.
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (ssd)), among others.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. An offline reinforcement learning network training method, the method comprising:
acquiring sample data;
updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network;
obtaining the distribution similarity of the distribution of the sample data and the distribution of the action network;
updating the action network based on the reward network, the risk network and the distribution similarity, and acquiring the updating times of the action network;
and when the updating times are less than or equal to a preset threshold value, updating the action network again until the updating times are greater than the preset threshold value.
2. The training method according to claim 1, wherein the updating network parameters of a reward network and a risk network of the action network according to the sample data and a current action network comprises:
calculating a first optimized network parameter for the bonus network by calculating:
Figure FDA0002516827560000011
wherein phi isRFor a first optimized network parameter of the reward network, argmin () is a function of the minimum value of the function, r is a single step reward value in the sample data, γ is an attenuation coefficient in a reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkRIn order to be able to reward the network,
Figure FDA0002516827560000012
is different from at+1Reward network Q ofRMaximum value of (d);
updating the network parameters of the reward network according to the first optimized network parameters;
calculating a second optimized network parameter for the risk network by calculating:
Figure FDA0002516827560000021
wherein phi isCFor a second optimized network parameter of the risk network, argmin () is a function of the minimum of the calculation function, c is the single step risk value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkCIn order for the risk network to be said,
Figure FDA0002516827560000022
is different from at+1Risk network ofCThe expected value of (d); a pi is the action network;
and updating the network parameters of the risk network according to the second optimized network parameters.
3. The training method according to claim 1, wherein the obtaining of the similarity between the distribution of the sample data and the distribution of the action network comprises:
and calculating the distance between the distribution of the sample data and the distribution of the action network based on an inverse phase relative entropy distance algorithm to serve as the distribution similarity.
4. A training method as claimed in any one of claims 1 to 3, wherein the updating the action network based on the reward network, risk network and distribution similarity comprises:
obtaining a third optimized network parameter of the action network by the following calculation mode:
φπ=argmax[QR-τ×QC+L];
τ=argmin|Q-D|;
wherein phi isπIs the actionThird optimized network parameter, Q, of the networkRFor said bonus network, QCFor the risk network, L is the distribution similarity, and tau is a Lagrange coefficient; d, setting a threshold value for the risk network;
updating the action network by the third optimized network parameter.
5. An offline reinforcement learning network training apparatus, the apparatus comprising:
an acquisition unit configured to acquire sample data;
the first updating unit is used for updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network;
the processing unit is used for acquiring the distribution similarity of the distribution of the sample data and the distribution of the action network;
the second updating unit is used for updating the action network based on the reward network, the risk network and the distribution similarity and acquiring the updating times of the action network; and when the updating times are less than or equal to a preset threshold value, obtaining the sample data again through the first obtaining unit, and updating the action network again until the updating times are more than the preset threshold value.
6. Training device according to claim 5, wherein the first updating unit is specifically configured to calculate a first optimized network parameter of the reward network: updating the network parameters of the reward network according to the first optimized network parameters;
wherein the network parameters of the bonus network are calculated by the following calculation method:
Figure FDA0002516827560000031
wherein phi isRFor a first optimized network parameter of the bonus network, argmin () is a minimum of a functionFunction, r is the single step reward value in the sample data, gamma is the attenuation coefficient in the reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkRIn order to be able to reward the network,
Figure FDA0002516827560000032
is different from at+1Reward network Q ofRMaximum value of (d);
the first updating unit is specifically configured to calculate a second optimized network parameter of the risk network; updating the network parameters of the risk network according to the second optimized network parameters;
wherein the network parameters of the risk network are calculated by the following calculation method:
Figure FDA0002516827560000033
wherein phi isCFor a second optimized network parameter of the risk network, argmin () is a function of the minimum of the calculation function, c is the single-step risk value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, stIs the state value of the sample data at time t, atIs s istInputting the action value, s, obtained by the action networkt+1Is the state value of the sample data at time t +1, at+1Is s ist+1Inputting the action value, Q, obtained by the action networkCIn order for the risk network to be said,
Figure FDA0002516827560000041
is different from at+1Risk network ofCThe expected value of (d); and A pi is the action network.
7. The training apparatus according to claim 5, wherein the processing unit is specifically configured to calculate, as the distribution similarity, a distance between the distribution of the sample data and the distribution of the action network based on an inverse relative entropy distance algorithm.
8. A training apparatus as claimed in any one of claims 5 to 7, wherein the second updating unit is specifically configured to calculate a third optimized network parameter of the action network; updating the action network through the third optimized network parameter;
wherein the third optimized network parameter of the action network is calculated by the following calculation method:
φπ=argmax[QR-τ×QC+L];
τ=argmin|Q-D|;
wherein phi isπA third optimized network parameter, Q, for the action networkRFor said bonus network, QCFor the risk network, L is the distribution similarity, and tau is a Lagrange coefficient; and D, setting a threshold value for the risk network.
9. An offline reinforcement learning network training system is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
a processor for implementing the off-line reinforcement learning network training method according to any one of claims 1 to 4 when the processor is used for a program stored in the memory.
10. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the offline reinforcement learning network training method of any one of claims 1 to 4.
CN202010479469.3A 2020-05-29 2020-05-29 Offline reinforcement learning network training method, device, system and storage medium Pending CN111652371A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010479469.3A CN111652371A (en) 2020-05-29 2020-05-29 Offline reinforcement learning network training method, device, system and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010479469.3A CN111652371A (en) 2020-05-29 2020-05-29 Offline reinforcement learning network training method, device, system and storage medium

Publications (1)

Publication Number Publication Date
CN111652371A true CN111652371A (en) 2020-09-11

Family

ID=72348144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010479469.3A Pending CN111652371A (en) 2020-05-29 2020-05-29 Offline reinforcement learning network training method, device, system and storage medium

Country Status (1)

Country Link
CN (1) CN111652371A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning
CN113360618A (en) * 2021-06-07 2021-09-07 暨南大学 Intelligent robot dialogue method and system based on offline reinforcement learning
CN114484584A (en) * 2022-01-20 2022-05-13 国电投峰和新能源科技(河北)有限公司 Heat supply control method and system based on offline reinforcement learning
CN116679615A (en) * 2023-08-03 2023-09-01 中科航迈数控软件(深圳)有限公司 Optimization method and device of numerical control machining process, terminal equipment and storage medium

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112668235A (en) * 2020-12-07 2021-04-16 中原工学院 Robot control method of DDPG algorithm based on offline model pre-training learning
CN112668235B (en) * 2020-12-07 2022-12-09 中原工学院 Robot control method based on off-line model pre-training learning DDPG algorithm
CN113360618A (en) * 2021-06-07 2021-09-07 暨南大学 Intelligent robot dialogue method and system based on offline reinforcement learning
CN113360618B (en) * 2021-06-07 2022-03-11 暨南大学 Intelligent robot dialogue method and system based on offline reinforcement learning
CN114484584A (en) * 2022-01-20 2022-05-13 国电投峰和新能源科技(河北)有限公司 Heat supply control method and system based on offline reinforcement learning
CN114484584B (en) * 2022-01-20 2022-11-11 国电投峰和新能源科技(河北)有限公司 Heat supply control method and system based on offline reinforcement learning
CN116679615A (en) * 2023-08-03 2023-09-01 中科航迈数控软件(深圳)有限公司 Optimization method and device of numerical control machining process, terminal equipment and storage medium
CN116679615B (en) * 2023-08-03 2023-10-20 中科航迈数控软件(深圳)有限公司 Optimization method and device of numerical control machining process, terminal equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111652371A (en) Offline reinforcement learning network training method, device, system and storage medium
KR20190028531A (en) Training machine learning models for multiple machine learning tasks
EP3568810B1 (en) Action selection for reinforcement learning using neural networks
CN114787832A (en) Method and server for federal machine learning
WO2017091629A1 (en) Reinforcement learning using confidence scores
US20070260563A1 (en) Method to continuously diagnose and model changes of real-valued streaming variables
CN110770764A (en) Method and device for optimizing hyper-parameters
EP3571631A1 (en) Noisy neural network layers
Rothfuss et al. Meta-learning priors for safe bayesian optimization
WO2020030052A1 (en) Animal count identification method, device, medium, and electronic apparatus
WO2021077097A1 (en) Systems and methods for training generative models using summary statistics and other constraints
CN109190757B (en) Task processing method, device, equipment and computer readable storage medium
CN115296984A (en) Method, device, equipment and storage medium for detecting abnormal network nodes
CN113158550B (en) Method and device for federated learning, electronic equipment and storage medium
US20210166131A1 (en) Training spectral inference neural networks using bilevel optimization
US20220148290A1 (en) Method, device and computer storage medium for data analysis
CN110399279B (en) Intelligent measurement method for non-human intelligent agent
CN111353597B (en) Target detection neural network training method and device
CN113505859B (en) Model training method and device, and image recognition method and device
US11501207B2 (en) Lifelong learning with a changing action set
US11710301B2 (en) Apparatus for Q-learning for continuous actions with cross-entropy guided policies and method thereof
EP3745313A1 (en) A predictive maintenance system for equipment with sparse sensor measurements
CN112101563A (en) Confidence domain strategy optimization method and device based on posterior experience and related equipment
CN111368792A (en) Characteristic point mark injection molding type training method and device, electronic equipment and storage medium
CN114844889B (en) Video processing model updating method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200911

RJ01 Rejection of invention patent application after publication