CN111242280A

CN111242280A - Deep reinforcement learning model combination method and device and computer equipment

Info

Publication number: CN111242280A
Application number: CN202010009647.6A
Authority: CN
Inventors: 温建伟; 王宇杰; 袁潮; 方璐
Original assignee: Beijing Zhuohe Technology Co Ltd
Current assignee: Shenzhen Zhuohe Technology Co ltd
Priority date: 2020-01-06
Filing date: 2020-01-06
Publication date: 2020-06-05

Abstract

The text discloses a combination method and a combination device of a deep reinforcement learning model and computer equipment, and relates to the deep reinforcement learning technology. Disclosed herein is a method of combining deep reinforcement learning models, comprising: determining weight information of each depth reinforcement learning model in a plurality of depth reinforcement learning models used in combination, and respectively transmitting data to be processed to the plurality of depth reinforcement learning models used in combination to obtain a plurality of output data; and performing weighted average calculation on the plurality of output data according to the weight information of the corresponding depth reinforcement learning model, wherein the calculated result is the output result of using the plurality of depth reinforcement learning models in a combined mode. According to the technical scheme, the output results of the multiple deep reinforcement learning models are determined to be used in combination based on the weight information of different deep reinforcement learning models. The obtained output result is more accurate and efficient.

Description

Deep reinforcement learning model combination method and device and computer equipment

Technical Field

The present invention relates to a deep reinforcement learning technology, and in particular, to a method and an apparatus for combining deep reinforcement learning models, and a computer device.

Background

The deep reinforcement learning combines the perception capability of the deep learning and the decision capability of the reinforcement learning, and is a learning and feedback between an Agent and the environment. The rapid accumulation of experience can be realized through deep reinforcement learning, and dynamic planning can be made for real-time conditions. For example, a game character belongs to an agent, and how the game character takes a series of actions in a learning environment can be determined through deep reinforcement learning, so that the maximum accumulated return is obtained. Where the state(s) is involved, i.e. the state the agent is currently in. Policy (policy), i.e. how to act in the current state. Action (a), the action taken by the agent according to the policy. Reward (r), i.e. the reward obtained after the corresponding action is taken in the state currently in. Model (model), i.e. by means of which the next state can be obtained knowing the state and the action currently being in. Q-Learning is a very popular technique for deep reinforcement Learning. Where the Q function is Q (s, a) represents the total reward value that can be obtained after performing action a from state s under a particular policy.

In the related art, a deep reinforcement learning algorithm can be combined, and model fusion is generally performed by simply averaging and summing a plurality of reinforcement learning models. However, when the state distribution region difference of the model in the feature space is large, the model obtained after fusion cannot simultaneously solve each sub-problem, or even cannot individually process any one sub-problem.

Disclosure of Invention

The application provides a combination method and device of a deep reinforcement learning model and computer equipment.

The application discloses a combination method of a deep reinforcement learning model, which comprises the following steps:

determining weight information of each depth reinforcement learning model in a plurality of depth reinforcement learning models used in combination, wherein the weight information of the depth reinforcement learning models is used for representation, and the influence degree of output data of the depth reinforcement learning models on output results of the depth reinforcement learning models used in combination is determined;

respectively transmitting data to be processed to a plurality of combined deep reinforcement learning models to obtain a plurality of output data;

and performing weighted average calculation on the plurality of output data according to the weight information of the corresponding depth reinforcement learning model, wherein the calculated result is the output result of using the plurality of depth reinforcement learning models in combination.

Optionally, in the above method, the determining weight information of each of a plurality of depth-enhanced learning models used in combination includes:

determining the similarity between preset input data and to-be-processed data of each deep reinforcement learning model;

respectively determining the weight information of each deep reinforcement learning model according to the similarity between preset input data of each deep reinforcement learning model and data to be processed;

the weight information of the deep reinforcement learning model is positively correlated with the similarity between the preset input data of the deep reinforcement learning model and the data to be processed.

Optionally, in the method, the performing weighted average calculation on the plurality of output data according to the weight information of the corresponding deep reinforcement learning model includes:

calculating a weighted sum of Q function values output by a plurality of depth reinforcement learning models used in combination according to the weight information of each depth reinforcement learning model;

and calculating the weighted average value of the Q function values according to the weighted sum of the Q function values and the number of the depth reinforcement learning models used in combination.

Optionally, the method further includes:

and generating a classification model according to the training data of each depth reinforcement learning model used in combination, wherein the classification model is used for determining the similarity between preset input data of different depth reinforcement learning models and the same input data.

Optionally, in the above method, the classification model at least includes one of a classifier constructed based on a variational self-encoder and a classification model constructed based on a neural network.

The application also discloses a composite set of deep reinforcement learning model, includes:

the weight information determining module is used for determining the weight information of each depth reinforcement learning model in a plurality of depth reinforcement learning models used in combination, wherein the weight information of the depth reinforcement learning models is used for representing, and the influence degree of the output data of the depth reinforcement learning models on the output result of the depth reinforcement learning models used in combination is determined;

the data transmission module is used for respectively transmitting the data to be processed to a plurality of combined deep reinforcement learning models to obtain a plurality of output data;

and the calculation module is used for performing weighted average calculation on the plurality of output data according to the weight information of the corresponding depth reinforcement learning model, and the calculated result is the output result of using the plurality of depth reinforcement learning models in a combined manner.

Optionally, in the above apparatus, the weight information determining module includes:

the first weight information determining submodule is used for determining the similarity between preset input data and to-be-processed data of each deep reinforcement learning model;

the second weight information determining submodule is used for respectively determining the weight information of each depth reinforcement learning model according to the similarity between the preset input data of each depth reinforcement learning model and the data to be processed;

Optionally, in the above apparatus, the calculation module includes:

the first calculation submodule is used for calculating the weighted sum of Q function values output by a plurality of depth reinforcement learning models which are used in combination according to the weight information of each depth reinforcement learning model;

and the second calculation submodule is used for calculating the weighted average value of the Q function values according to the weighted sum of the Q function values and the number of the depth reinforcement learning models used in combination.

Optionally, the apparatus further comprises:

the classification module is used for generating a classification model according to the training data of each depth reinforcement learning model used in combination, and the classification model is used for determining the similarity between preset input data of different depth reinforcement learning models and the same input data;

and the first weight information determining submodule determines the similarity between the preset input data of each deep reinforcement learning model and the data to be processed through the classification module.

Optionally, in the above apparatus, the classification module includes at least one of a classifier constructed based on a variational self-encoder and a classification model constructed based on a neural network.

a processor;

and a memory storing processor-executable instructions;

wherein the processor is configured to:

instructions implementing the combined method of deep reinforcement learning model described above are executed.

The present application also discloses a computer readable storage medium having a computer program stored thereon, wherein the computer program when executed implements the steps of the method of combining deep reinforcement learning models as described above.

The technical scheme of the application provides a combination scheme of deep reinforcement learning models, and the influence degrees of different deep reinforcement learning models on output data are considered to be different. Therefore, based on the weight information of different depth reinforcement learning models, the output results of using a plurality of depth reinforcement learning models in combination are determined as a weighted average of the output results of the plurality of depth reinforcement learning models. The method realizes the fusion of the deep reinforcement learning model, and the obtained output result is more accurate and efficient.

Drawings

Fig. 1 is a flowchart illustrating a method for combining deep reinforcement learning models according to an exemplary embodiment of the present application.

Fig. 2 is a schematic diagram illustrating a method for combining deep reinforcement learning models according to an exemplary embodiment of the present application.

FIG. 3 is a block diagram of a combination apparatus of a deep reinforcement learning model according to an exemplary embodiment of the present application.

FIG. 4 is a block diagram of a combination apparatus of a deep reinforcement learning model according to an exemplary embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be further described in detail with reference to the accompanying drawings. It should be noted that the embodiments and features of the embodiments of the present application may be arbitrarily combined with each other without conflict.

Fig. 1 is a flowchart illustrating a method for combining deep reinforcement learning models according to this embodiment. As shown in fig. 1, the method includes the operations of:

step S101, determining weight information of each depth-enhanced learning model in a plurality of depth-enhanced learning models used in combination, wherein the weight information of the depth-enhanced learning models is used for representation, and the influence degree of output data of the depth-enhanced learning models on output results of the depth-enhanced learning models used in combination is determined;

step S102, respectively transmitting data to be processed to a plurality of depth reinforcement learning models used in combination to obtain a plurality of output data;

step S103, carrying out weighted average calculation on the plurality of output data according to the weight information of the corresponding depth reinforcement learning model, wherein the calculated result is the output result of using the plurality of depth reinforcement learning models in combination.

The weight information of the different deep reinforcement learning models comprises the influence degree of the output data of the different deep reinforcement learning models on the target output result when the plurality of deep reinforcement learning models are used in combination. For example, the output data of different deep reinforcement learning models may be a factor in the final output result of a combination of a plurality of deep reinforcement learning models.

It can be seen that, compared with the related art, the method of simply and directly performing average calculation on the output results of the multiple deep reinforcement learning models is adopted. According to the technical scheme, when a plurality of depth reinforcement learning models are combined and used, the influence degrees of output data of different depth reinforcement learning models on the target output result may be different. Therefore, the result of using a plurality of depth reinforcement learning models in combination is determined as a weighted average of the output data of the plurality of depth reinforcement learning models based on the weight information of different depth reinforcement learning models. The output result determined in this way is more accurate and efficient.

The present embodiment further provides a method for combining depth-enhanced learning models, in which determining weight information of each of a plurality of depth-enhanced learning models used in combination includes:

respectively determining the weight information of each deep reinforcement learning model according to the similarity between preset input data of the deep reinforcement learning models and data to be processed;

Herein, the weight information of the deep reinforcement learning model is represented in a positive correlation with the similarity between the preset input data of the deep reinforcement learning model and the data to be processed, and the higher the similarity between the preset input data of the deep reinforcement learning model and the data to be processed is, the larger the weight information of the deep reinforcement learning model is. Correspondingly, the lower the similarity between the preset input data of the deep reinforcement learning model and the data to be processed is, the smaller the weight information of the deep reinforcement learning model is. The weight information of the deep reinforcement learning model may include a weight coefficient.

The similarity between the preset input data of the deep reinforcement learning model and the data to be processed may include a similarity between feature information of the preset input data of the deep reinforcement learning model and feature information of the data to be processed. For example, through deep reinforcement learning, determining how a game character takes a series of actions in a learning environment to obtain a maximum cumulative reward, the pending data may include the state S of the game character. At this point, state S may be characterized by a set of characteristic information. The similarity between the state S and the preset input data of the deep reinforcement learning model can be determined by comparing the feature information of the preset input data of the deep reinforcement learning model with the feature information contained in the state S. For example, in two different deep reinforcement learning models, the feature information of the preset input data of the first deep reinforcement learning model is all the same or substantially the same as the feature information included in the state S. Only part of the feature information of the preset input data of the second deep reinforcement learning model is the same or basically the same as the feature information contained in the state S. The similarity between the preset input data of the first deep reinforcement learning model and the data to be processed is higher than the similarity between the preset input data of the second deep reinforcement learning model and the data to be processed.

As can be seen from the above description, the weight information determined by using the similarity between the preset input data of the deep reinforcement learning model and the data to be processed in this embodiment may indicate that, the higher the similarity between the preset input data and the data to be processed is, the closer the output data of the deep reinforcement learning model and the output data corresponding to the data to be processed is, that is, the greater the influence of the output data of the deep reinforcement learning model on the output result of using a plurality of deep reinforcement learning models in combination is. In this way, the output data of each of the depth-enhanced learning models can be adjusted based on the weight information of each of the depth-enhanced learning models, and the weight of the output data of each of the depth-enhanced learning models can be increased to a greater extent than the weight of the final output result of the combined use of the plurality of depth-enhanced learning models. The final output result obtained is closer to the actual output result.

The present embodiment further provides a method for combining deep reinforcement learning models, in which performing weighted average calculation on a plurality of output data according to weight information of corresponding deep reinforcement learning models includes:

calculating a weighted sum of Q function values output by a plurality of depth reinforcement learning models used in combination according to the weight information of each depth reinforcement learning model used in combination;

The Q function value output by the deep reinforcement learning model may include a calculated value of the function Q (s, a) in the deep reinforcement learning application.

Supposing that the Q function corresponding to each deep reinforcement learning model is Q_iI.e. Q_iAnd representing the Q function of the ith deep reinforcement learning model. Computing a weighted sum of Q functions of a plurality of depth reinforcement learning models used in combination as

Wherein, α_iAnd representing the weight coefficient of the ith deep reinforcement learning model.

The weighted average of the Q function values may be calculated by the following formula:

wherein n is the number of the deep reinforcement learning models used in combination.

In this embodiment, the sum of the weighting coefficients of the n deep reinforcement learning models is equal to n, i.e.

Is equal to n.

It can be seen that, in the technical solution of this embodiment, based on the weight information of different depth-enhanced learning models, the output result of using a plurality of depth-enhanced learning models in combination is determined as a weighted average of Q function values of the plurality of depth-enhanced learning models. The output result determined in this way is more accurate and efficient.

The embodiment also provides a combination method of the deep reinforcement learning model, and the method further includes:

In this context, historical input data in training data of different deep reinforcement learning models can be sampled and trained.

And analyzing the historical input data of different depth reinforcement learning models by the classification model to determine the similarity between the preset input data of the different depth reinforcement learning models and the same input data. Therefore, the classification operation of different deep reinforcement learning models is realized. That is, the similarity between the preset input data of different deep reinforcement learning models and the same input data is distinguished through the classification model.

In this embodiment, the classification model at least includes one of a classifier constructed based on a variational self-encoder and a classification model constructed based on a neural network.

When the classification model comprises a classifier constructed based on a variational self-encoder, the training input data of different deep reinforcement learning models can be sampled, and the variational self-encoder-based identification network corresponding to each deep reinforcement learning model is trained respectively. In this way, when new input data is received, the similarity between the current input data and the input data of the deep reinforcement learning model training can be determined according to the identification network corresponding to each deep reinforcement learning model.

The following describes an implementation process of the above deep reinforcement learning model combination method by taking practical applications as an example.

This embodiment takes the most extensive Q-Learning application as an example to illustrate the combination process of the deep reinforcement Learning model. Wherein the Q function in Q-Learning is Q (s, a) which indicates that under a specific strategy, the Q function is selected fromState s is the total reward value that can be obtained after performing action a. The principle of the process is shown in fig. 2, and a plurality of depth-enhanced learning models (i.e. Q in fig. 2) used in combination are determined in real time₁，Q₂……Q_n) The classification result compared with the current input data (i.e. D in FIG. 2)₁，D₂……D_n) Convert the classification result into a weight coefficient (i.e. α in fig. 2)₁，α₂……α_n) Then, a weighted average function is determined

Namely, the Q function corresponding to the multiple deep reinforcement learning models is used in combination.

The combination process of the multiple deep reinforcement learning models comprises the following operations:

step 1, respectively collecting training data aiming at each deep reinforcement learning model, and respectively adding the collected training data into a cache buffer;

herein, training data collected by different deep reinforcement learning models can be distinguished;

the manner of collecting the training data may include various manners.

Assume that action a is randomly selected based on the current state_mObtaining a set of training data of s through a deep reinforcement learning model_m,a_m,s_m’,r_mWhere m denotes a time m, m' denotes a time next to the time m, s_mRepresents the state at time m, a_mRepresents the action at time m, s_m'represents the state at time m', r_mI.e. represents the reward at time m. And in the same way, adding the collected training data into the buffer.

And 2, acquiring new training data for each deep reinforcement learning model to be combined, randomly sampling the training data from the buffer, respectively using the training data as input data of a positive sample and input data of a negative sample for training an identification network corresponding to each deep reinforcement learning model to determine the similarity between the same input data and the training input data of different deep reinforcement learning models, and forming a classification model by using the identification network corresponding to each deep reinforcement learning model.

The generated classification model may include various forms of models, among others.

For example, a classification model may be constructed from the encoder based on the variational basis. For each deep reinforcement learning model, two side-by-side variational self-encoders can be used to form an identification network, positive and negative samples are respectively input, and the output of the variational self-encoders is stacked and then passes through a multilayer perceptron to obtain an output result. The discrimination network learned in this way can determine the similarity between the training input data and the current input data of the deep reinforcement learning model. And combining the identification networks corresponding to each deep reinforcement learning model to obtain a classifier, namely the classifier belongs to the classification model.

For another example, a classification model based on a neural network may be established, that is, input data of different deep reinforcement learning models are trained and learned by using a target neural network, so as to determine similarity between training input data of different deep reinforcement learning models and current input data. Training the learned classification model is also referred to herein as the classification model.

And 3, when the data to be processed is acquired, sending the data to be processed to the classification model to obtain a classification result D, and converting the classification result into a weight coefficient α of the deep reinforcement learning model.

And the obtained classification result comprises the classification result of each combined deep reinforcement learning model compared with the data to be processed. For example, the obtained classification result includes the classification result D of the first deep reinforcement learning model₁Classification result D of the second deep reinforcement learning model₂… … classification result D of nth deep reinforcement learning model_n. The classification result herein may include a similarity between preset input data of the deep reinforcement learning model and the data to be processed.

The weighting factors α of the depth-enhanced learning models into which the classification results are converted include, for example, the weighting factor of each depth-enhanced learning model used in combinationWeight coefficient α of deep reinforcement learning model₁The weight coefficient α of the second deep reinforcement learning model₂… … weight coefficient α of nth deep reinforcement learning model_n。

In this embodiment, the classification result and the weight coefficient have positive correlation. That is, in the classification result, the closer the data to be processed is to the preset input data of a certain deep reinforcement learning model to be combined, that is, the higher the similarity between the data to be processed and the preset input data of the deep reinforcement learning model is, the larger the weight coefficient obtained by conversion is. In the classification result, the more the data to be processed deviates from the preset input data of a certain deep reinforcement learning model to be combined, that is, the lower the similarity between the data to be processed and the preset input data of the deep reinforcement learning model is, the smaller the weight coefficient obtained by conversion is.

For example, the linear relationship α between the classification result and the weighting factor_i＝μD_iOr the classification result is exponential α with the weight coefficient_i∝exp(D_i)。

And 4, determining Q functions of a plurality of depth reinforcement learning models used in combination according to the weight coefficients of different depth reinforcement learning models.

In the step 4, the Q function may be determined according to the following formula 1 or formula 2:

in the formula, n is the total number of the depth reinforcement learning models used in combination;

α_iweighting coefficients for the ith deep reinforcement learning model;

wherein, the sum of the weighting coefficients α of the n depth-enhanced learning models to be used in combination in formula 1 is n;

the value obtained by dividing the weighting coefficient α of each depth-enhanced learning model in formula 2 by n is less than 1, and the sum of the values obtained by dividing the weighting coefficients of the n depth-enhanced learning models to be used in combination by n is equal to 1;

Q_iis the Q function (state-action value function) of the ith deep reinforcement learning model.

Herein, the Q function (state-action value function) of the deep reinforcement Learning model may include the Q function involved in the Soft Q-Learning method.

It can be seen that the output result calculated according to the Q function is a weighted average of the output results of the plurality of deep reinforcement learning models, that is, a final output result obtained by using the plurality of deep reinforcement learning models in combination.

In addition, the operation of step 3 may be to send the acquired input data to the classification model each time the input data is acquired, obtain a classification result, and convert the classification result into a weight coefficient. That is, the present embodiment may determine the weight coefficients of the depth-enhanced learning model in real time for different input data to calculate the weighted average of the output results of the plurality of depth-enhanced learning models. In this way, the obtained output result obtained by using a plurality of deep reinforcement learning models in combination is more accurate for different input data.

Fig. 3 is a schematic structural diagram of a combination apparatus of a deep reinforcement learning model according to an exemplary embodiment. As shown in fig. 3, the apparatus includes at least a weight information determination module 31, a data transmission module 32, and a calculation module 33.

The weight information determining module 31 is configured to determine weight information of each of a plurality of depth-enhanced learning models used in combination, wherein the weight information of the depth-enhanced learning models is used for characterization, and the influence degree of output data of the depth-enhanced learning models on output results of the plurality of depth-enhanced learning models used in combination is determined;

the data transmission module 32 is configured to transmit the data to be processed to a plurality of depth reinforcement learning models used in combination respectively to obtain multiple output data;

and a calculation module 33 configured to perform weighted average calculation on the plurality of output data according to the weight information of the corresponding depth reinforcement learning model, wherein the result of the calculation is an output result obtained by using the plurality of depth reinforcement learning models in combination.

The embodiment also provides a combination device of a deep reinforcement learning model, in which the weight information determining module includes:

the first weight information determining submodule is configured to determine the similarity between preset input data and to-be-processed data of each deep reinforcement learning model;

the second weight information determining submodule is configured to respectively determine the weight information of each depth reinforcement learning model according to the similarity between preset input data of the depth reinforcement learning model and data to be processed;

The embodiment also provides a combined device of a deep reinforcement learning model, in which the computing module includes:

a first calculation submodule configured to calculate a weighted sum of Q function values output from a plurality of depth reinforcement learning models used in combination, in accordance with weight information of each depth reinforcement learning model;

and the second calculating submodule is configured to calculate a weighted average value of the Q function values according to the weighted sum of the Q function values and the number of the depth reinforcement learning models used in combination.

The embodiment further provides a combined device of the deep reinforcement learning model, and the device further includes:

the classification module is configured to generate a classification model according to training data of each depth reinforcement learning model used in combination, wherein the classification model is used for determining the similarity between preset input data of different depth reinforcement learning models and the same input data;

at the moment, the first weight information determining submodule determines the similarity between the preset input data of each deep reinforcement learning model and the data to be processed through the classification module.

In the apparatus, the classification module at least includes one of a classifier constructed based on a variational self-encoder and a classification model constructed based on a neural network.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

FIG. 4 is a block diagram illustrating an apparatus 400 for virtual rocker control in accordance with an exemplary embodiment. Referring to fig. 4, the apparatus 400 includes a processor 401, and the number of the processors may be set to one or more as needed. The apparatus 400 also includes a memory 402 for storing instructions, such as an application program, that are executable by the processor 401. The number of the memories can be set to one or more according to needs. Which may store one or more application programs. Processor 401 is configured to execute instructions to perform the virtual rocker control method described above.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 402 comprising instructions, executable by the processor 401 of the apparatus 400 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

A non-transitory computer readable storage medium having instructions therein which, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a method of combining deep reinforcement learning models, comprising:

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, apparatus (device), or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied in the medium. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, including, but not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer, and the like. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The present invention has been described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (devices) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that an article or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such article or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of additional like elements in the article or device comprising the element.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, it is intended that the present invention cover the modifications and variations of this invention provided they come within the scope of the appended claims and their equivalents.

Claims

1. A combination method of deep reinforcement learning models is characterized by comprising the following steps:

2. The method according to claim 1, wherein the determining weight information of each of the plurality of depth-enhanced learning models used in combination comprises:

3. The method according to claim 1 or 2, wherein the performing a weighted average calculation on the plurality of output data according to the weight information of the corresponding deep reinforcement learning model comprises:

4. The method of claim 3, further comprising:

5. An apparatus for combining deep reinforcement learning models, the apparatus comprising:

6. The apparatus of claim 5, wherein the weight information determining module comprises:

7. The apparatus of claim 5 or 6, wherein the computing module comprises:

8. The apparatus of claim 7, further comprising:

9. A combination of deep reinforcement learning models, comprising:

a processor;

and a memory storing processor-executable instructions;

wherein the processor is configured to:

instructions to perform a combinatorial method of implementing the deep reinforcement learning model of any one of claims 1 to 4.

10. A computer-readable storage medium, on which a computer program is stored, which, when executed, carries out the steps of the method of combining deep reinforcement learning models according to any one of claims 1 to 4.