CN111652371A

CN111652371A - Offline reinforcement learning network training method, device, system and storage medium

Info

Publication number: CN111652371A
Application number: CN202010479469.3A
Authority: CN
Inventors: 詹仙园; 徐浩然; 张玥; 霍雨森; 朱翔宇; 李春洋; 邓欣; 郑宇�
Original assignee: Jingdong City Beijing Digital Technology Co Ltd
Current assignee: Jingdong City Beijing Digital Technology Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2020-09-11

Abstract

The invention relates to an offline reinforcement learning network training method, device, system and storage medium. The method comprises the following steps: updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network; acquiring the distribution similarity of the distribution of the sample data and the distribution of the action network; and updating the action network based on the reward network, the risk network and the distribution similarity. The embodiment of the invention updates the corresponding reward network and risk network according to the sample data and the action network, so that the reward network and the risk network adapt to the action network, obtains the distribution similarity of the distribution of the sample data and the distribution of the output result of the action network, finishes the updating of the action network based on the evaluation of the reward network and the risk network on the action network and the distribution similarity of the sample data and the action network, and finishes the optimization of the action network after circulating the preset times of the steps.

Description

Offline reinforcement learning network training method, device, system and storage medium

Technical Field

The invention relates to the technical field of big data processing, in particular to an offline reinforcement learning network training method, device and system and a storage medium.

Background

Most Reinforcement Learning (RL) algorithms learn good strategies, such as in the gaming and robotics domains, only after a significant number of attempts and errors have been experienced in the simulation environment. But in real-world scenarios (e.g. autonomous driving of cars, complex industrial system control), we do not have a perfect simulated environment, we have only a collection of pre-collected environmental interaction data, which also includes some unsafe attempts.

Therefore, how to train a strategy that maximizes long-term rewards and satisfies security constraints from the offline data is a problem to be solved urgently.

Disclosure of Invention

In order to solve the problems in the prior art, at least one embodiment of the present invention provides an offline reinforcement learning network training method, apparatus, system and storage medium.

In a first aspect, an embodiment of the present invention provides an offline reinforcement learning network training method, where the method includes:

acquiring sample data;

updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network;

obtaining the distribution similarity of the distribution of the sample data and the distribution of the action network;

updating the action network based on the reward network, the risk network and the distribution similarity, and acquiring the updating times of the action network;

and when the updating times are less than or equal to a preset threshold value, updating the action network again until the updating times are greater than the preset threshold value.

Based on the above technical solutions, the embodiments of the present invention may be further improved as follows.

With reference to the first aspect, in a first embodiment of the first aspect, the updating network parameters of a reward network and a risk network of the action network according to the sample data and a current action network includes:

calculating a first optimized network parameter for the bonus network by calculating:

wherein phi is_RFor a first optimized network parameter of the reward network, argmin () is a function of the minimum of a function, r is a single step reward value in the sample data, γ is an attenuation coefficient in a reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_RIn order to be able to reward the network,

is different from a_t+1Reward network Q of_RMaximum value of (d);

updating the network parameters of the reward network according to the first optimized network parameters;

calculating a second optimized network parameter for the risk network by calculating:

wherein phi is_CFor a second optimized network parameter of the risk network, argmin () is a function of the minimum of the calculation function, c is the single-step risk value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_CIn order for the risk network to be said,

is different from a_t+1Risk network of_CThe expected value of (d); a. the_πIs the action network;

and updating the network parameters of the risk network according to the second optimized network parameters.

With reference to the first aspect, in a second embodiment of the first aspect, the obtaining the distribution similarity between the distribution of the sample data and the distribution of the action network includes:

and calculating the distance between the distribution of the sample data and the distribution of the action network based on an inverse phase relative entropy distance algorithm to serve as the distribution similarity.

With reference to the first aspect or the first or second embodiment of the first aspect, in a third embodiment of the first aspect, the updating the action network based on the reward network, risk network, and distribution similarity includes:

obtaining a third optimized network parameter of the action network by the following calculation mode:

φ_π＝argmax[Q_R-τ×Q_C+L]；

τ＝argmin|Q-D|；

wherein phi is_πA third optimized network parameter, Q, for the action network_RFor said bonus network, Q_CFor the risk network, L is the distribution similarity, τ is LagA Langmian coefficient; d, setting a threshold value for the risk network;

updating the action network by the third optimized network parameter.

In a second aspect, an embodiment of the present invention provides an offline reinforcement learning network training apparatus, where the apparatus includes:

an acquisition unit configured to acquire sample data;

the first updating unit is used for updating network parameters of a reward network and a risk network of the action network according to the sample data and the current action network;

the processing unit is used for acquiring the distribution similarity of the distribution of the sample data and the distribution of the action network;

the second updating unit is used for updating the action network based on the reward network, the risk network and the distribution similarity and acquiring the updating times of the action network; and when the updating times are less than or equal to a preset threshold value, obtaining the sample data again through the first obtaining unit, and updating the action network again until the updating times are more than the preset threshold value.

With reference to the second aspect, in a first embodiment of the second aspect, the first updating unit is specifically configured to calculate the first optimized network parameter of the bonus network: updating the network parameters of the reward network according to the first optimized network parameters;

wherein the network parameters of the bonus network are calculated by the following calculation method:

wherein phi is_RFor a first optimized network parameter of the reward network, argmin () is a function of the minimum of a function, r is a single step reward value in the sample data, γ is an attenuation coefficient in a reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value obtained by the action network，s_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_RIn order to be able to reward the network,

is different from a_t+1Reward network Q of_RMaximum value of (d);

the first updating unit is specifically configured to calculate a second optimized network parameter of the risk network; updating the network parameters of the risk network according to the second optimized network parameters;

wherein the network parameters of the risk network are calculated by the following calculation method:

is different from a_t+1Risk network of_CThe expected value of (d); a. the_πIs the action network.

With reference to the second aspect, in a second embodiment of the second aspect, the processing unit is specifically configured to calculate, as the distribution similarity, a distance between the distribution of the sample data and the distribution of the action network based on an inverse relative entropy distance algorithm.

With reference to the second aspect or the first or second embodiment of the second aspect, in a third embodiment of the second aspect, the second updating unit is specifically configured to calculate a third optimized network parameter of the action network; updating the action network by the third optimized network parameter.

Wherein the third optimized network parameter of the action network is calculated by the following calculation method:

φ_π＝argmax[Q_R-τ×Q_C+L]；

τ＝argmin|Q-D|；

wherein phi is_πA third optimized network parameter, Q, for the action network_RFor said bonus network, Q_CFor the risk network, L is the distribution similarity, and tau is a Lagrange coefficient; and D, setting a threshold value for the risk network.

In a third aspect, an example of the present invention provides an offline reinforcement learning network training system, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor, configured to implement any of the offline reinforcement learning network training methods of the first aspect when executing a program stored in the memory.

In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium storing one or more programs, which are executable by one or more processors to implement the offline reinforcement learning network training method according to any one of the first aspects.

Compared with the prior art, the technical scheme of the invention has the following advantages: the embodiment of the invention updates the corresponding reward network and risk network according to the sample data and the action network, so that the reward network and the risk network adapt to the action network, obtains the distribution similarity of the distribution of the sample data and the distribution of the output result of the action network, finishes the updating of the action network based on the evaluation of the reward network and the risk network on the action network and the distribution similarity of the sample data and the action network, and finishes the optimization of the action network after circulating the preset times of the steps.

Drawings

Fig. 1 is a schematic flowchart of an offline reinforcement learning network training method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating an offline reinforcement learning network training method according to another embodiment of the present invention;

fig. 3 is a flowchart illustrating a method for training an offline reinforcement learning network according to another embodiment of the present invention;

fig. 4 is a flowchart illustrating a second method for training an offline reinforcement learning network according to another embodiment of the present invention;

FIG. 5 is a schematic structural diagram of an offline reinforcement learning network training apparatus according to another embodiment of the present invention;

fig. 6 is a schematic structural diagram of an offline reinforcement learning network training system according to yet another embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

As shown in fig. 1, an embodiment of the present invention provides an offline reinforcement learning network training method. Referring to fig. 1, the method includes the steps of:

and S11, acquiring sample data.

In this embodiment, sample data may be obtained from offline data, a corresponding neural network model is obtained based on the sample data training, that is, an action network in the present application, in a traditional action-reward network (operator-critical) framework, an Actor-critical algorithm is divided into two parts, and we can separately see that the predecessor of an Actor is a policy gradient algorithm, and can easily select a suitable action in a continuous action space, and the qlearing of value-based explodes because the space is too large, but the learning efficiency is slow because the Actor is updated based on the round, and we find that a single-step update can be realized by using a value-based algorithm as critical, so that the two algorithms complement each other to form our Actor-critical model, and the sample data in the present application may be initial data of any reinforcement learning network, for example, the present invention is not limited to the environment interaction data in the fields of an autonomous vehicle, a game, and a robot.

Reinforcement learning, english name: reinforcement learning is a field of machine learning, emphasizing how to select an optimal action strategy based on environmental conditions to achieve maximum expected profit. The reinforcement learning task corresponds to a quadruple. Where a state space is represented, each state being a description of the environment perceived by the system; the actions (actions) that the system can take constitute an action space; if some action acts on the current state, the underlying transfer function t (transfer probability) will cause the environment to transfer with some probability from the current state to another state; while transitioning to another state, the environment feeds back to the system a reward based on a potential reward (reward) function.

And S12, updating the network parameters of the reward network and the risk network of the action network according to the sample data and the current action network.

In this embodiment, in the Actor-Critic framework, after inputting sample data into the action network, a corresponding output result is obtained, but the action network may not be the optimal network from the beginning, at this time, for the output result and the real result in the sample data, the reward network and the risk network of the action network generate a reward value and a risk value, and by the reward value and the risk value, the difference between the distribution of the action network and the distribution of the sample data can be laterally determined, for example, the action network obtains an output result for the sample data, and the reward network and the risk network give a rating value and a risk value for the action of the action network, i.e. the more the output result is consistent with the real result in the sample data, the greater the rating value should be, and the smaller the risk value should be, if the reward network and the risk network in the actual situation do not give correct reward value and risk value, at this time, the reward network and the risk network can be updated, and the action of the current action network can be accurately evaluated by the reward network and the risk network.

In this embodiment, the reward network and the risk network may be updated according to a result obtained after each sample data is input into the action network, so that the reward value output by the reward network is at a maximum value when the risk value output by the risk network is within a preset threshold value; or updating the reward network and the risk network corresponding to the action network according to the difference between the output result obtained after the sample data is input into the action network and the real result in the sample data,

and S13, acquiring the distribution similarity of the sample data and the action network.

In this embodiment, in the offline reinforcement learning algorithm, the action network and the sample data are required to have a distribution similar to each other, in this embodiment, the similarity value between the distribution of the sample data and the distribution of the action network may be calculated by a cosine value, specifically, the distribution of the sample data and the distribution of the action network may be converted into a corresponding vector, the cosine value of the corresponding vector may be calculated as the distribution similarity between the distribution of the sample data and the distribution of the action network, and the relative entropy distance between the sample data and the action network may also be calculated as the distribution similarity.

And S14, updating the action network based on the reward network, the risk network and the distribution similarity, and acquiring the updating times of the action network.

In this embodiment, the action network is updated according to the reward network, the risk network and the distribution similarity, and since the reward value of the reward network is the maximum value when the risk value of the risk network is the minimum value, and the action network and the distribution of the sample data represented by the distribution similarity of the sample data are the most similar finally, the action network can be updated by determining whether the reward value of the reward network, the risk value of the risk network and the distribution similarity all satisfy corresponding preset conditions in this step.

In this embodiment, the updated action network should enable the risk value output by the risk network to be within the preset risk value, the reward value output by the reward network to be the maximum value, and the distribution similarity to be the maximum value, or enable the risk value minimum value output by the risk network, the reward value output by the reward network at the same time, and the distribution similarity to be the maximum value.

And S15, when the updating times are less than or equal to the preset threshold value, updating the action network again until the updating times are more than the preset threshold value.

In the embodiment, the training of the action network can be completed by executing the steps for the preset times, the strategy requirement on the collected data is loose, the robustness is realized, and the method conforms to the actual application scene.

In this embodiment, step S13 specifically includes: and calculating the distance between the distribution of the sample data and the distribution of the action network based on an inverse phase relative entropy distance algorithm to serve as the distribution similarity.

As shown in fig. 2, an embodiment of the present invention provides an offline reinforcement learning network training method. Compared with the training method shown in fig. 1, the difference is that the rewarding network parameters of the action network are updated according to the sample data and the current action network, and the method specifically comprises the following steps:

s21, calculating the first optimized network parameter of the reward network by the following calculation mode:

wherein phi is_RFor rewarding the first optimized network parameter of the network, argmin () is a function of the minimum of the function, r is the single step reward value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, s_tIs a sample at time tState value of data, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_RIn order to reward the network(s),

is different from a_t+1Reward network Q of_RMaximum value of (d);

and S22, updating the network parameters of the bonus network according to the first optimized network parameters.

In this embodiment, the updating of the parameters of the reward network is completed through the above formula, so that the reward network conforms to the sample data and the action network, that is, the reward value output by the reward network after the sample data is input can be used for evaluating the output result of the sample data input action network.

In this embodiment, the single-step reward value r is a value that can be obtained in the Actor-critical framework, and the above embodiments are described accordingly, which will not be described herein again because the action value a_t+1Is based on the state value s of the sample data_t+1The input action network is obtained, and the sample data has at least one group, so that different a in the above formula can be obtained_t+1Reward network Q of_RIs measured.

As shown in fig. 3, an embodiment of the present invention provides an offline reinforcement learning network training method. Compared with the training method shown in fig. 1, the difference is that the risk network parameters of the action network are updated according to the sample data and the current action network, and the method specifically includes the following steps:

s31, calculating a second optimized network parameter of the risk network by the following calculation mode:

wherein phi is_CFor the second optimized network parameter of the risk network, argmin () is a function of the minimum of the calculation function,c is the single step risk value in the sample data, gamma is the attenuation coefficient in the reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_CIn order to be a risk network,

is different from a_t+1Risk network of_CThe expected value of (d); a. the_πIs an action network.

And S32, updating the network parameters of the risk network according to the second optimized network parameters.

In this embodiment, each parameter in the present scheme is similar to the parameters in the above embodiments, and details are not repeated in this step.

In this embodiment, in updating the risk network, the expected value of the risk network at the next moment is used, which is done for three reasons, 1) the definition of the cumulative risk and the jackpot are different, in which the goal is to maximize the jackpot and make the cumulative risk less than a given threshold, instead of minimizing the cumulative risk; 2) compared with the online learning situation, it is difficult to evaluate the accumulated risk value in the offline learning, so that the state-action pair (s, a) with the smallest accumulated risk value is most likely to be out of the data distribution range, and a great error is introduced to evaluate the accumulated risk value of the state pair. Thus, using the desired format can greatly mitigate the problem of inaccurate estimates of cumulative risk values.

As shown in fig. 4, an embodiment of the present invention provides an offline reinforcement learning network training method. Compared with the training method shown in fig. 1, the difference is that the action network is updated based on the reward network, the risk network and the distribution similarity, and the method comprises the following steps:

s41, obtaining a third optimized network parameter of the action network by the following calculation mode:

φ_π＝argmax[Q_R-τ×Q_C+L]；

τ＝argmin|Q-D|；

wherein phi is_πThird optimized network parameter, Q, for action network_RFor rewarding the network, Q_CIs a risk network, L is distribution similarity, and tau is a Lagrange coefficient; d, setting a threshold value for the risk network;

and S42, updating the action network through the third optimized network parameter.

In this embodiment, a third optimized network parameter is obtained through calculation, and the network parameter of the action network is updated through the third optimized network parameter, so as to complete a one-time optimization process of the action network. A Lagrange coefficient is introduced into the formula, a large belt constrained problem is converted into an unconstrained optimization problem through a Lagrange relaxation method, and a random gradient descent algorithm is used for updating the strategy and the Lagrange coefficient.

As shown in fig. 5, an embodiment of the present invention provides an offline reinforcement learning network training apparatus, and the system includes: an acquisition unit 11, a first updating unit 12, a processing unit 13 and a second updating unit 14.

In this embodiment, the obtaining unit 11 is configured to obtain sample data;

in this embodiment, the first updating unit 12 is configured to update network parameters of a reward network and a risk network of an action network according to sample data and a current action network;

in this embodiment, the processing unit 13 is configured to obtain a distribution similarity between a distribution of sample data and a distribution of an action network;

in this embodiment, the second updating unit 14 is configured to update the action network based on the reward network, the risk network, and the distribution similarity, and obtain the update times of the action network; when the update frequency is less than or equal to the preset threshold, the first obtaining unit 11 obtains the sample data again, and updates the action network again until the update frequency is greater than the preset threshold.

In this embodiment, the first updating unit 12 is specifically configured to calculate the first optimized network parameter of the bonus network: updating the network parameters of the reward network according to the first optimized network parameters;

the network parameters of the reward network are calculated in the following calculation mode:

wherein phi is_RFor rewarding the first optimized network parameter of the network, argmin () is a function of the minimum of the function, r is the single step reward value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_RIn order to reward the network(s),

is different from a_t+1Reward network Q of_RMaximum value of (d);

a first updating unit 12, specifically configured to calculate a second optimized network parameter of the risk network; updating the network parameters of the risk network according to the second optimized network parameters;

wherein, the network parameters of the risk network are calculated by the following calculation mode:

wherein phi is_CFor the second optimized network parameter of the risk network, argmin () is the function of the minimum of the computation function, c is the single step risk value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_CAs risk netThe combination of the ingredients of the Chinese medicinal preparation,

In this embodiment, the processing unit 13 is specifically configured to calculate, as the distribution similarity, a distance between the distribution of the sample data and the distribution of the action network based on an inverse phase relative entropy distance algorithm.

In this embodiment, the second updating unit 14 is specifically configured to calculate a third optimized network parameter of the action network; and updating the action network through the third optimized network parameter.

Wherein the third optimized network parameter of the action network is calculated by the following calculation mode:

φ_π＝argmax[Q_R-τ×Q_C+L]；

τ＝argmin|Q-D|；

wherein phi is_πThird optimized network parameter, Q, for action network_RFor rewarding the network, Q_CIs a risk network, L is distribution similarity, and tau is a Lagrange coefficient; and D, setting a threshold value for the risk network.

As shown in fig. 6, an embodiment of the present invention provides an offline reinforcement learning network training system, which includes a processor 1110, a communication interface 1120, a memory 1130, and a communication bus 1140, wherein the processor 1110, the communication interface 1120, and the memory 1130 complete communication with each other through the communication bus 1140;

a memory 1130 for storing computer programs;

the processor 1110, when executing the program stored in the memory 1130, implements the offline reinforcement learning network training method as follows:

acquiring sample data;

acquiring the distribution similarity of the distribution of the sample data and the distribution of the action network;

and when the updating times are less than or equal to the preset threshold, updating the action network again until the updating times are greater than the preset threshold.

In the electronic device provided by the embodiment of the present invention, the processor 1110 updates the corresponding reward network and risk network according to the sample data and the action network by executing the program stored in the memory 1130, so that the reward network and the risk network adapt to the action network, obtains the distribution similarity between the distribution of the sample data and the distribution of the output result of the action network, completes the update of the action network based on the evaluation of the reward network and the risk network on the action network and the distribution similarity between the sample data and the action network, and completes the optimization of the action network after circulating the preset times of the above steps.

The communication bus 1140 mentioned in the above electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus 1140 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface 1120 is used for communication between the electronic device and other devices.

The memory 1130 may include a Random Access Memory (RAM), and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 1130 may also be at least one memory device located remotely from the processor 1110.

The processor 1110 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

The embodiment of the present invention provides a computer-readable storage medium, where one or more programs are stored, and the one or more programs are executable by one or more processors to implement the offline reinforcement learning network training method according to any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the embodiments of the invention are brought about in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wirelessly (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (ssd)), among others.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An offline reinforcement learning network training method, the method comprising:

acquiring sample data;

2. The training method according to claim 1, wherein the updating network parameters of a reward network and a risk network of the action network according to the sample data and a current action network comprises:

wherein phi is_RFor a first optimized network parameter of the reward network, argmin () is a function of the minimum value of the function, r is a single step reward value in the sample data, γ is an attenuation coefficient in a reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_RIn order to be able to reward the network,

is different from a_t+1Reward network Q of_RMaximum value of (d);

wherein phi is_CFor a second optimized network parameter of the risk network, argmin () is a function of the minimum of the calculation function, c is the single step risk value in the sample data, γ is the attenuation coefficient in the reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_CIn order for the risk network to be said,

is different from a_t+1Risk network of_CThe expected value of (d); a pi is the action network;

3. The training method according to claim 1, wherein the obtaining of the similarity between the distribution of the sample data and the distribution of the action network comprises:

4. A training method as claimed in any one of claims 1 to 3, wherein the updating the action network based on the reward network, risk network and distribution similarity comprises:

φ_π＝argmax[Q_R-τ×Q_C+L]；

τ＝argmin|Q-D|；

wherein phi is_πIs the actionThird optimized network parameter, Q, of the network_RFor said bonus network, Q_CFor the risk network, L is the distribution similarity, and tau is a Lagrange coefficient; d, setting a threshold value for the risk network;

updating the action network by the third optimized network parameter.

5. An offline reinforcement learning network training apparatus, the apparatus comprising:

an acquisition unit configured to acquire sample data;

6. Training device according to claim 5, wherein the first updating unit is specifically configured to calculate a first optimized network parameter of the reward network: updating the network parameters of the reward network according to the first optimized network parameters;

wherein phi is_RFor a first optimized network parameter of the bonus network, argmin () is a minimum of a functionFunction, r is the single step reward value in the sample data, gamma is the attenuation coefficient in the reinforcement learning method, s_tIs the state value of the sample data at time t, a_tIs s is_tInputting the action value, s, obtained by the action network_t+1Is the state value of the sample data at time t +1, a_t+1Is s is_t+1Inputting the action value, Q, obtained by the action network_RIn order to be able to reward the network,

is different from a_t+1Reward network Q of_RMaximum value of (d);

is different from a_t+1Risk network of_CThe expected value of (d); and A pi is the action network.

7. The training apparatus according to claim 5, wherein the processing unit is specifically configured to calculate, as the distribution similarity, a distance between the distribution of the sample data and the distribution of the action network based on an inverse relative entropy distance algorithm.

8. A training apparatus as claimed in any one of claims 5 to 7, wherein the second updating unit is specifically configured to calculate a third optimized network parameter of the action network; updating the action network through the third optimized network parameter;

φ_π＝argmax[Q_R-τ×Q_C+L]；

τ＝argmin|Q-D|；

9. An offline reinforcement learning network training system is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing the off-line reinforcement learning network training method according to any one of claims 1 to 4 when the processor is used for a program stored in the memory.

10. A computer-readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the offline reinforcement learning network training method of any one of claims 1 to 4.