CN107832836B

CN107832836B - Model-free deep reinforcement learning exploration method and device

Info

Publication number: CN107832836B
Application number: CN201711205687.2A
Authority: CN
Inventors: 季向阳; 张子函; 张宏昌
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2017-11-27
Filing date: 2017-11-27
Publication date: 2020-04-21
Anticipated expiration: 2037-11-27
Also published as: CN107832836A

Abstract

The disclosure relates to a model-free deep reinforcement learning exploration method and device, wherein the method comprises the following steps: obtaining a characteristic value according to the sample; inputting the characteristic value into a deep reinforcement learning model for processing to obtain an action value; inputting the characteristic value into a counting model to obtain an action counting value; and determining a decision action according to the action value and the action counting value. By selecting the actions with different execution times, the environment report values of all the actions are obtained more comprehensively in the exploration process of deep reinforcement learning, and the exploration efficiency is further improved.

Description

Model-free deep reinforcement learning exploration method and device

Technical Field

The disclosure relates to the technical field of machine learning, in particular to a model-free deep reinforcement learning exploration method and device.

Background

Deep Reinforcement Learning, is a brand new algorithm that combines Deep Learning and Reinforcement Learning to realize end-to-end Learning that perceives Action actions from persistence. In brief, as for human beings, perceptual information such as vision is input, and then actions are directly output through a deep neural network without any hand-manipulated human work in the middle. Deep reinforcement learning has the potential to enable a robot to learn one or more skills completely autonomously. Reinforcement learning is one approach to solving the sequential decision problem. In recent years, deep reinforcement learning has achieved certain results in tasks based on image input, using neural networks as estimators of algorithms. When action decision is made, the intelligent agent needs to decide to execute a certain action according to historical experience, so that the core of the deep reinforcement learning problem by using the neural network is how to compress a historical sample with a larger magnitude and how to obtain a training result which is more consistent with an actual application scene in the training process of the neural network.

Disclosure of Invention

In view of this, the present disclosure provides a model-free deep reinforcement learning exploration method and device, so as to solve the problem of how to obtain a training result that better conforms to an actual application scenario by the deep reinforcement learning exploration method.

According to an aspect of the present disclosure, there is provided a model-free deep reinforcement learning exploration method, the method including:

obtaining a characteristic value according to the sample;

inputting the characteristic value into a deep reinforcement learning model for processing to obtain an action value;

inputting the characteristic value into a counting model to obtain an action counting value;

and determining a decision action according to the action value and the action counting value.

In one possible implementation, the method further includes:

performing the decision-making action;

acquiring a return value returned by the environment;

determining an error value according to the return value and the decision action;

and adjusting parameters of the deep reinforcement learning model and the counting model by utilizing a back propagation algorithm according to the error value.

In one possible implementation, the method further includes: inputting the characteristic value into an auxiliary decision model for processing to obtain an auxiliary action value;

determining a decision action according to the action value and the action count value, further comprising: and determining a decision action according to the action value, the action count value and the auxiliary action value.

In a possible implementation manner, inputting the feature value into an assistant decision model for processing to obtain an assistant action value includes:

and the assistant decision-making model determines an assistant action value according to the characteristic value and the random return value.

In one possible implementation form of the method,

obtaining feature values from the sample, including:

carrying out convolution processing on the sample by utilizing a plurality of convolution cores to obtain a plurality of convolution characteristics;

and splicing the obtained plurality of convolution characteristics to obtain the characteristic value.

In one possible implementation, the sample includes: a first state and an action of an environment, the first state comprising a state prior to execution of the action;

inputting the characteristic value into a counting model to obtain an action counting value, wherein the action counting value comprises the following steps:

the counting model extracts a first state and action of the sample according to the input characteristic value;

corresponding the first state and the action of the sample, and determining a state action pair;

searching the determined state action pairs in a state action pair set, and updating the access estimation times of the determined state action pairs, wherein the state action pair set comprises a plurality of state action pairs and a set formed by the access estimation times of each state action pair;

and determining the updated state action pair set as an action count value.

In one possible implementation, determining a decision action according to the action value and the action count value includes:

determining an adjustment value of the action value according to the access estimation times in the action counting value, wherein the more the access estimation times are, the smaller the determined adjustment value is;

and determining a decision action according to the action adjustment value and the action value.

According to another aspect of the present disclosure, there is provided a model-free deep reinforcement learning exploration apparatus, comprising:

the characteristic value acquisition module is used for acquiring a characteristic value according to the sample;

the deep reinforcement learning module is used for inputting the characteristic value into a deep reinforcement learning model for processing to obtain an action value;

the counting module is used for inputting the characteristic value into a counting model to obtain an action counting value;

and the decision action determining module is used for determining a decision action according to the action value and the action counting value.

In one possible implementation, the apparatus further includes:

an action execution module for executing the decision action;

the return value acquisition module is used for acquiring a return value returned by the environment;

an error value determining module, configured to determine an error value according to the report value and the decision action;

and the parameter adjusting module is used for adjusting the parameters of the depth reinforcement learning model, the counting model and the assistant decision model by utilizing a back propagation algorithm according to the error value.

In one possible implementation, the apparatus further includes:

the assistant decision module is used for inputting the characteristic value into an assistant decision model for processing to obtain an assistant action value;

the decision action determining module further comprises:

and the assistant decision sub-module is used for determining a decision action according to the action value, the action count value and the assistant action value.

In one possible implementation, the assistant decision module includes:

and the auxiliary action value submodule is used for determining an auxiliary action value according to the characteristic value and the random return value.

In a possible implementation manner, the feature value obtaining module includes:

the convolution processing submodule is used for carrying out convolution processing on the sample by utilizing a plurality of convolution cores to obtain a plurality of convolution characteristics;

and the characteristic value acquisition submodule is used for splicing the obtained plurality of convolution characteristics to acquire the characteristic value.

the counting model module comprises:

the state action extraction submodule is used for extracting a first state and an action of the sample according to the input characteristic value;

the state action pair determining submodule is used for corresponding the first state and the action of the sample and determining a state action pair;

the access frequency estimation submodule is used for searching the determined state action pairs in the state action pair set and updating the access estimation frequency of the determined state action pairs, wherein the state action pair set comprises a plurality of state action pairs and a set consisting of the access estimation frequency of each state action pair;

and the action count value determining submodule is used for determining the updated state action pair set as an action count value.

In one possible implementation, the decision action determining module includes:

the adjustment value determining submodule is used for determining an adjustment value of the action value according to the access estimation times in the action counting value, wherein the more the access estimation times are, the smaller the determined adjustment value is;

and the decision action determining submodule is used for determining a decision action according to the action adjusting value and the action value.

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to perform the model-free deep reinforcement learning exploration method described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the model-free deep reinforcement learning exploration method described above.

According to the method and the device, the execution times of each action in each state are recorded through a counting model, and in the process of determining the decision-making action, the action with the less execution times is preferred. By selecting the actions with different execution times, the exploration benefits of executing each action under the current condition can be more comprehensively obtained in the exploration process of the deep reinforcement learning, and the exploration efficiency is further improved.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

FIG. 1 illustrates a flow diagram of a model-free deep reinforcement learning exploration method, according to an embodiment of the present disclosure;

FIG. 2 illustrates a flow diagram of a model-free deep reinforcement learning exploration method, according to an embodiment of the present disclosure;

FIG. 3 illustrates a flow diagram of a model-free deep reinforcement learning exploration method, according to an embodiment of the present disclosure;

FIG. 4 illustrates a flow diagram of a model-free deep reinforcement learning exploration method, according to an embodiment of the present disclosure;

FIG. 5 illustrates a flow diagram of a model-free deep reinforcement learning exploration method, according to an embodiment of the present disclosure;

FIG. 6 illustrates a flow diagram of a model-free deep reinforcement learning exploration method, according to an embodiment of the present disclosure;

FIG. 7 illustrates a flow diagram of a model-free deep reinforcement learning exploration method, according to an embodiment of the present disclosure;

fig. 8 illustrates a flowchart of extracting sample feature values in a neural network-based action recognition method according to an embodiment of the present disclosure;

FIG. 9 shows a block diagram of a model-free deep reinforcement learning exploration apparatus, according to an embodiment of the present disclosure;

FIG. 10 shows a block diagram of a model-free deep reinforcement learning exploration apparatus, according to an embodiment of the present disclosure;

FIG. 11 shows a block diagram of a model-free deep reinforcement learning exploration apparatus, according to an embodiment of the present disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

In the field of artificial intelligence, agents are generally used to represent an object with behavioral capabilities, such as a robot, an unmanned vehicle, a person, and so on. The problem of reinforcement learning consideration is the task of interaction between the Agent and the environment. For example, when a robot arm needs to pick up a mobile phone, the environment is the object around the robot arm, the robot arm senses the environment through an external camera, and then the robot arm needs to output an action to pick up the mobile phone. In addition, in the top-quality flying car game, a player only sees a screen, which is the environment, and then outputs an action (keyboard operation) to control the movement of the car, which includes a series of action actions, Observation of Observation, and Reward value Reward. The Reward means that after the Agent executes the action and interacts with the environment, the environment changes, and the good and the bad of the change are represented by the Reward. As in the above example. If the robotic arm is closer to the handset then Reward should be positive, and if the race car is playing a race car game, the race car is increasingly off the track then Reward is negative. Secondly, the observer uses the observer rather than the environment because Agent does not necessarily get all the information of the environment, for example, the camera on the robot arm can only get the picture of a certain angle. Therefore, the perception information acquired by an Agent can only be represented by the observer. DQN (deep-reinforcement learning network) is typically used to solve the final decision-making action determined by recognizing images. The DQN network comprises a Q network and a target neural network, the target neural network is updated by training the Q network, and finally a final decision action is determined according to the Q value. In the process of determining the decision-making action, a method of randomly selecting the next action according to the proportion of the local return value is called an exploration method, and a method of directly selecting the action with the highest Q value is called a utilization method.

Fig. 1 shows a flow chart of a neural network-based action recognition method according to an embodiment of the present disclosure, such as the method shown in fig. 1, including the following steps:

in step S10, a feature value is obtained from the sample.

In a possible implementation manner, in the training process of the DQN network, firstly, the samples to be processed, such as images of games, are preprocessed, such as graying and downsampling, and then the preprocessed game images are input into an image processing model for feature extraction, including using a convolutional neural network as the image processing model. Fig. 8 is a flowchart illustrating the extraction of sample feature values in the neural network-based motion recognition method according to an embodiment of the present disclosure, and as shown in fig. 8, feature values are obtained after inputting the preprocessed four consecutive frames of game images P1, P2, P3, and P4 into a convolutional neural network model for feature extraction.

For example, also taking a racing game as an example, after preprocessing a game screen of the racing game, the preprocessed game screen is input into a convolutional neural network to extract a feature value. The swordsman picture at each moment is in a state, and the operation action of the racing car is a decision action.

And step S20, inputting the characteristic value into a deep reinforcement learning model for processing to obtain an action value.

In one possible implementation, the feature values are input to a DQN network, which extracts states in the feature values and outputs an action value (Q value). The Q value is the value of an action in a certain state and is a function of the state and the action, including the Q value of each action.

For example, after the deep reinforcement learning model processes the feature values, the value of the final operation actions of the racing cars is obtained, for example, according to the state 1 of the game image in the feature values, it is determined that the value of the action 1 of operating the racing cars to the upper left and the action 2 of operating the racing cars to the upper right in the state 1 is the highest (the action does not deviate from the track, and the driving direction is correct, which leads to the lead in the race); action 3 of operating the car left down and action 4 of operating the car right down are less valuable (not deviating from the track, but in the opposite direction, resulting in a lag in the race); action 5 of operating the car to the left, and action 6 of operating the car to the right, have the lowest value (off course, resulting in loss of interest).

And step S30, inputting the characteristic value into a counting model to obtain an action counting value.

In a possible implementation manner, the counting model extracts and records states and actions in the feature values, and acquires the execution times of each action in each state in the training process.

For example, according to the record of the counting model, in the state 1, the number of times of execution of the action 1 is a, the number of times of execution of the action 2 is B, the number of times of execution of the action 3 is C, the number of times of execution of the action 4 is D, the number of times of execution of the action 5 is E, the number of times of execution of the action 6 is F, and a > B > C > D > E > F.

And step S40, determining a decision action according to the action value and the action count value.

In a possible implementation manner, in order to meet different requirements, different combinations are performed on the magnitude of the action value and the magnitude of the action count value, and different decision actions are determined, including selecting an action with a high action value and a small action count value as a decision action, selecting an action with a low action value and a large action technical value as a decision action, and further including setting different threshold ranges for the action value and the action count value respectively, preferably, an action within the threshold range at the same time is a decision action, which is not limited by the present disclosure.

For example, in the above-described racing game, the action 1 and the action 2 are preferable in accordance with the action value, and the action 2 is preferable in accordance with the action count value. Finally, the decision action 2 is the final decision action.

In this embodiment, the number of times of execution of each action in each state is recorded by a counting model, and in the process of determining the decision-making action, an action with a small number of times of execution is preferable. By selecting the actions with different execution times, the exploration benefits of executing each action under the current condition can be more comprehensively obtained in the exploration process of the deep reinforcement learning, and the exploration efficiency is further improved.

Fig. 2 is a flowchart of a model-free deep reinforcement learning exploration method according to an embodiment of the present disclosure, and as shown in fig. 2, on the basis of the above embodiment, the method further includes:

step S50, the decision action is executed.

And step S60, acquiring the return value returned by the environment.

In one possible implementation, after the decision-making action is executed, the state of the environment changes, and the environment gives a return value of the decision-making action.

For example, in a racing game, the decision-making action is action 2 of operating the car to the upper right, and after action 2 is performed, the game gives a positive return value: the player is awarded points.

In one possible implementation, the reward value given by the environment after the decision-making action is performed may also be a negative reward.

Step S70, determining an error value according to the reported value and the decision action.

In a possible implementation manner, the actual action value of the decision-making action is obtained according to the return value given by the environment, and the error value can be determined by comparing the actual action value of the decision-making action with the action value of the decision-making action.

For example, in the racing game, the decision-making action is action 2, the calculated action value of action 2 is a, the environment report value after the action 2 is executed is Z, and the difference between a and Z is an error value.

And step S80, adjusting the parameters of the depth reinforcement learning model and the counting model by using a back propagation algorithm according to the error value.

In one possible implementation, parameters of the deep reinforcement learning model and the counting model are adjusted using a back propagation algorithm based on the determined error value. And performing iterative computation of next model-free deep reinforcement learning exploration by using the adjusted deep reinforcement learning model and the counting model.

In this embodiment, after obtaining the return value of the environment by executing the decision-making action, calculating an error value, adjusting parameters of the depth-enhanced learning model and the counting model by using the error value, and performing iterative computation of next model-free depth-enhanced learning exploration on the adjusted depth-enhanced learning model and counting model. And adjusting parameters of the deep reinforcement learning model and the counting model according to a return value given by the environment, and providing more accurate parameters for next iterative calculation, so that the exploration process of the deep reinforcement learning is more consistent with the actual operating environment.

Fig. 3 is a flowchart of a model-free deep reinforcement learning exploration method according to an embodiment of the present disclosure, where the method shown in fig. 3 further includes the following steps based on the above embodiment:

and step S50, inputting the characteristic value into an auxiliary decision model for processing to obtain an auxiliary action value.

Step 40, determining a decision action according to the action value and the action count value, further comprising: and step 41, determining a decision action according to the action value, the action count value and the auxiliary action value.

In one possible implementation, to better illustrate the assistant decision model, the deep reinforcement learning model and the counting model in the above example are collectively referred to as a main network. The structure of the assistant decision model is the same as that of the main network, but parameters in the assistant decision model are different from those in the main network, so that an assistant action value different from that of the main network can be provided through the assistant decision model in order to ensure the stability of the action value. And providing a fixed environment return value in the assistant decision model, thereby ensuring that the action value can be converged to a constant. In this embodiment, the difference between the value of the auxiliary action and the value of the action is calculated, and the calculated difference is incorporated into the decision making process.

For example, in the racing game, the value of the assist action of each action, for example, the value of the assist action of action 1 is a ', the value of the assist action of action 2 is B', etc., is obtained from the return value given by the assist decision model. And comprehensively calculating the action value, the auxiliary action value and the action count value of each action to obtain the final decision action.

In this embodiment, in order to ensure the convergence of the action value, an auxiliary decision model is introduced, the auxiliary action value is obtained through the auxiliary decision model, a difference between the auxiliary action value and the action value is calculated, the calculated difference is included in the decision action determining process, the exploration process of the deep reinforcement learning is driven, and the exploration process is more converged.

Fig. 4 is a flowchart illustrating a model-free deep reinforcement learning exploration method according to an embodiment of the present disclosure, and as shown in fig. 4, on the basis of the above embodiment, step S50 includes:

and step S51, the assistant decision model determines an assistant action value according to the characteristic value and the random return value.

In a possible implementation manner, in the assistant decision model, a random return value is set for each action in each state instead of obtaining a return value returned by the environment, and the assistant decision model determines an assistant action value of each action in each state according to the random return value and a feature value extracted from a sample, wherein the expectation of the random return is a preset fixed value, and the random return is distributed with various choices.

In this embodiment, by setting a random return value for the assistant decision network, completely different error values are obtained

Fig. 5 is a flowchart illustrating a model-free deep reinforcement learning exploration method according to an embodiment of the present disclosure, where, in the method illustrated in fig. 5, step S10 includes the following steps:

step S11, performing convolution processing on the sample using the plurality of convolution kernels to obtain a plurality of convolution characteristics.

And step S12, splicing the obtained plurality of convolution characteristics to obtain the characteristic value.

In one possible implementation, when the convolutional neural network is used to perform convolution processing on a sample, a plurality of convolution kernels are set to obtain a plurality of convolution characteristics. After the plurality of convolution characteristics are spliced, characteristic values are obtained, and therefore compression of a state space is achieved.

In this embodiment, after a plurality of convolution cores are used to perform convolution processing on a sample, convolution characteristics are obtained, after the convolution characteristics are spliced, characteristic values are obtained, compression of a state space is performed on the premise that sample characteristics are retained to the maximum extent, and on the premise that the accuracy of an exploration result is guaranteed, the calculation efficiency of an exploration process is improved.

Fig. 6 shows a flowchart of a model-free deep reinforcement learning exploration method according to an embodiment of the present disclosure, in the method shown in fig. 6, the samples include: a first state of an environment and an action, the first state comprising a state prior to execution of the action.

In a possible implementation manner, the sample includes a first state, a second state, an action, and a report value, where the action is a decision-making action to be executed, the first state is a state before the action is executed, the second state is a state after the action is executed, and the report value is a report value given by an environment after the action is executed.

On the basis of the above embodiment, step S30 includes:

in step S31, the counting model extracts a first state and an action of the sample according to the input feature value.

Step S32, the first state and the action of the sample are associated with each other, and a state action pair is determined.

Step S33, finding the determined state action pair from the state action pair set, and updating the access estimation times of the determined state action pair, where the state action pair set includes a plurality of state action pairs and a set of access estimation times of each state action pair.

In step S34, the updated state action pairs are collected and determined as an action count value.

In a possible implementation manner, the counting model includes a state action pair composed of each action in each state and the access estimation times of each state action pair, and the state in the state action pair is the state before the action in the state action pair is executed. And updating the access estimation times of the corresponding state action pair in the counting model according to the first state and action in the sample. The action count value given by the count model is a set of access estimation times of each action in each state.

For example, in the racing game described above, there are a plurality of states, such as state 1, state 2, state 3, etc., each of which includes a plurality of actions, such as action 1, action 2, action 3, etc. Then a plurality of state action pairs, such as state 1-action 1, state 1-action 2, state 1-action 3, state 2-action 1, state 2-action 2, state 2-action 3, state 3-action 1, state 3-action 2, and state 3-action 3, are recorded in the counting model, and the number of times of access estimation of each state action pair is recorded. And determining that the state action pair is the state 1-action 2 according to the characteristic values in the samples, and updating the access estimation times of the state action pair of the state 1-action 2.

In this embodiment, the access estimation times of the state action pairs in the counting model are updated through the first state and action in the sample feature, and the updated state action pair set is determined as an action count value. In deep reinforcement learning exploration, the number of times of access estimation of the state action pair can be used for determining decision actions according to the number of times of execution of the actions, and the exploration efficiency can be improved.

Fig. 7 is a flowchart illustrating a model-free deep reinforcement learning exploration method according to an embodiment of the present disclosure, and as shown in fig. 7, on the basis of the foregoing embodiment, step S40 includes:

and step S41, determining an adjustment value of the action value according to the access estimation times in the action counting value, wherein the more the access estimation times, the smaller the determined adjustment value.

And step S42, determining a decision action according to the action adjustment value and the action value.

In a possible implementation manner, the size of the action adjustment value is determined according to the number of access estimation times of the state action pairs in the counting model, the action with less access estimation times is preferred after the access estimation times of each state action pair are sorted, and the action with high action value is preferred as the decision action.

And calculating the action value of each action in the action values with the corresponding adjustment value of each action respectively, and determining the decision action.

In this embodiment, by estimating the number of times of access of the state action pair, in the determination process of the decision action, an action with a small number of times of access estimation is preferentially selected, and the exploration efficiency can be improved.

Fig. 9 is a block diagram of an apparatus for model-free deep reinforcement learning exploration according to an embodiment of the present disclosure, and as shown in fig. 9, the apparatus provided in this embodiment includes:

and a feature value obtaining module 41, configured to obtain a feature value according to the sample.

And the deep reinforcement learning module 42 is configured to input the feature value into a deep reinforcement learning model for processing, so as to obtain an action value.

And the counting module 43 is configured to input the feature value into a counting model to obtain an action count value.

A decision action determining module 44, configured to determine a decision action according to the action value and the action count value.

Fig. 10 is a block diagram of a model-free deep reinforcement learning exploration apparatus according to an embodiment of the present disclosure, and as shown in fig. 10, on the basis of the embodiment shown in fig. 9, the apparatus further includes:

and an action executing module 45, configured to execute the decision-making action.

And a return value acquiring module 46, configured to acquire a return value returned by the environment.

And an error value determining module 47, configured to determine an error value according to the report value and the decision action.

And a parameter adjusting module 48, configured to adjust parameters of the deep reinforcement learning model, the counting model, and the assistant decision model according to the error value by using a back propagation algorithm.

In one possible implementation, the apparatus further includes:

and the assistant decision module 49 is used for inputting the characteristic value into an assistant decision model for processing to obtain an assistant action value.

The decision action determining module 44 further includes:

and the assistant decision sub-module 443 is configured to determine a decision action according to the action value, the action count value and the assistant action value.

In a possible implementation manner, the assistant decision module 49 includes:

the auxiliary action value sub-module 491 is configured to determine an auxiliary action value according to the feature value and the random return value.

In a possible implementation manner, the feature value obtaining module 41 includes:

the convolution processing submodule 411 is configured to perform convolution processing on the sample by using a plurality of convolution cores to obtain a plurality of convolution characteristics;

and the eigenvalue obtaining submodule 412 is configured to splice the obtained multiple convolution characteristics to obtain the eigenvalues.

the count model module 43 includes:

a state action extraction submodule 431, configured to extract a first state and an action of the sample according to the input feature value;

a state action pair determining submodule 432, configured to correspond the first state and the action of the sample, and determine a state action pair;

the access frequency estimation submodule 433 is configured to search the determined state action pairs in a state action pair set, and update the access estimation frequency of the determined state action pairs, where the state action pair set includes a plurality of state action pairs and a set of access estimation frequencies of each state action pair;

the action count value determination submodule 434 is configured to determine the updated state action pair set as an action count value.

In one possible implementation, the decision-making action determining module 44 includes:

an adjustment value determining submodule 441, configured to determine an adjustment value of the action value according to the access estimation times in the action count value, where the greater the access estimation times, the smaller the determined adjustment value is;

the decision-making action determining sub-module 442 is configured to determine a decision-making action according to the action adjustment value and the action value.

FIG. 11 is a block diagram illustrating an apparatus 800 for model-less deep reinforcement learning exploration, according to an exemplary embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 11, the apparatus 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

Power components 806 provide power to the various components of device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 800.

The multimedia component 808 includes a screen that provides an output interface between the device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate communications between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the device 800 to perform the above-described methods.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A model-free deep reinforcement learning exploration method, the method comprising:

obtaining a characteristic value according to the sample;

determining a decision action based on the action value and the action count value,

wherein the method further comprises: inputting the characteristic value into an auxiliary decision model for processing to obtain an auxiliary action value;

2. The method of claim 1, further comprising:

performing the decision-making action;

acquiring a return value returned by the environment;

3. The method of claim 1, wherein inputting the feature values into an assistant decision model for processing to obtain an assistant action value comprises:

4. The method of claim 1, wherein obtaining feature values from the samples comprises:

5. The method of claim 1, wherein the samples comprise: a first state and an action of an environment, the first state comprising a state prior to execution of the action;

and determining the updated state action pair set as an action count value.

6. The method of claim 5, wherein determining a decision action based on the action value and the action count value comprises:

and determining a decision action according to the adjustment value of the action value and the action value.

7. A model-free deep reinforcement learning exploration device, comprising:

a decision action determination module for determining a decision action based on the action value and the action count value,

wherein the apparatus further comprises: the assistant decision module is used for inputting the characteristic value into an assistant decision model for processing to obtain an assistant action value;

the decision action determining module further comprises: and the assistant decision sub-module is used for determining a decision action according to the action value, the action count value and the assistant action value.

8. The apparatus of claim 7, further comprising:

an action execution module for executing the decision action;

9. The apparatus of claim 7, wherein the aid decision module comprises:

10. The apparatus of claim 7, wherein the eigenvalue acquisition module comprises:

11. The apparatus of claim 7, wherein the sample comprises: a first state and an action of an environment, the first state comprising a state prior to execution of the action;

the counting model module comprises:

12. The apparatus of claim 11, wherein the decision action determining module comprises:

and the decision action determining submodule is used for determining a decision action according to the adjustment value of the action value and the action value.

13. A model-free deep reinforcement learning exploration device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: when executed, implement the method of any one of claims 1 to 6.

14. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1 to 6.