CN113077052A

CN113077052A - Reinforced learning method, device, equipment and medium for sparse reward environment

Info

Publication number: CN113077052A
Application number: CN202110466716.0A
Authority: CN
Inventors: 吴天博; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-06
Anticipated expiration: 2041-04-28
Also published as: CN113077052B

Abstract

The invention discloses a reinforcement learning method, a reinforcement learning device, reinforcement learning equipment and a reinforcement learning medium for a sparse reward environment, wherein the reinforcement learning method comprises the following steps: respectively interacting the actions with a plurality of current environment states to obtain a plurality of environment states at the next moment; calculating the similarity of the environmental state at the next moment to obtain a similarity matrix; judging whether the current environment state is influenced by random noise or not according to the similarity matrix; if the current environment state can be influenced by random noise, calculating an intrinsic reward value through a preset environment familiarity model; and performing strategy learning according to empirical data generated by the interaction with the environment and the calculated intrinsic reward value. According to the reinforcement learning method provided by the embodiment of the disclosure, the strategy can be learned quickly and effectively under the condition that the external reward is sparse or does not exist.

Description

Reinforced learning method, device, equipment and medium for sparse reward environment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a reinforcement learning method, device, equipment and medium for a sparse reward environment.

Background

All the content interacted with the intelligent agent in the reinforcement learning is called environment, the environment can provide the state for the intelligent agent, the intelligent agent makes a decision according to the state, and the environment is fed back to the intelligent agent for reward. However, in the actual reinforcement learning task, many rewards are sparse, the environment cannot feed back the reward value in time according to each decision of the agent, and even more, the reward can be obtained in the final state, such as games of weiqi, revenge of monte ancestor and the like.

The sparse reward problem may cause the reinforcement learning algorithm to iterate slowly and even to converge difficultly. At present, aiming at the tasks with sparse rewards, the adopted methods are as follows: reward remodeling, experience playback, exploration and utilization, and the like. But reward remodeling reconstructs a reward value, the reward remodeling has no universality, the experience playback is only suitable for an off-line algorithm, a curiosity-based method is adopted in exploration, a prediction model of a state is constructed by the curiosity-based exploration method, the curiosity of the environment is measured through the difference between the predicted next state and the actual next state, the prediction model is taken as an internal reward, but the model can lose significance under the influence of random noise, so that the next state is unpredictable and is irrelevant to the decision of an intelligent agent.

Disclosure of Invention

The embodiment of the disclosure provides a reinforcement learning method, a reinforcement learning device, reinforcement learning equipment and reinforcement learning media for a sparse reward environment. The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview and is intended to neither identify key/critical elements nor delineate the scope of such embodiments. Its sole purpose is to present some concepts in a simplified form as a prelude to the more detailed description that is presented later.

In a first aspect, the disclosed embodiments provide a reinforcement learning method for a sparse reward environment, including:

respectively interacting the actions with a plurality of current environment states to obtain a plurality of environment states at the next moment;

calculating the similarity of the environment state at the next moment to obtain a similarity matrix;

judging whether the current environment state is influenced by random noise according to the similarity matrix;

if the current environment state can be influenced by random noise, calculating an intrinsic reward value through a preset environment familiarity model;

and performing strategy learning according to empirical data generated by the interaction with the environment and the calculated intrinsic reward value.

In an optional embodiment, before the action is interacted with the plurality of current environment states respectively, the method further includes:

and copying the acquired current environment state to obtain a plurality of current environment states.

In an optional embodiment, determining whether the current environment state is affected by random noise according to the similarity matrix includes:

calculating the sum of the similarity matrixes;

and if the sum of the similarity matrixes is smaller than a preset threshold value, determining that the current environment state is influenced by random noise.

In an optional embodiment, if the current environment state is not affected by random noise, the method further includes:

the intrinsic reward value is calculated based on a curiosity method.

In an alternative embodiment, the intrinsic reward value is calculated based on a curiosity method, including:

adopting a neural network to construct an environment model;

inputting the current environment state and the current action into an environment model, and outputting a predicted value of the next state;

and calculating a prediction error according to the predicted value of the next state, and taking the prediction error as an internal reward value.

In an optional embodiment, if the current environmental status is affected by random noise, calculating the intrinsic reward value through a preset environmental familiarity model includes:

randomly acquiring a preset number of historical environment states from the historical environment states in a random sampling mode;

constructing a random similarity matrix according to the similarity of the randomly acquired historical environment state and the current environment state;

and constructing an environment familiarity model according to the random similarity matrix.

In an alternative embodiment, the intrinsic reward value is calculated by an environmental familiarity model, comprising:

where f(s) represents the intrinsic prize value and Σ sim represents the sum of the random similarity matrices.

In a second aspect, the disclosed embodiments provide a reinforcement learning apparatus for sparse reward environments, including:

the interaction module is used for respectively interacting the actions with the current environment states to obtain a plurality of environment states at the next moment;

the first calculation module is used for calculating the similarity of the environment state at the next moment to obtain a similarity matrix;

the judging module is used for judging whether the current environment state is influenced by random noise or not according to the similarity matrix;

the second calculation module is used for calculating the intrinsic reward value through a preset environment familiarity model when the current environment state is influenced by random noise;

and the strategy learning module is used for learning the strategy according to empirical data generated by the interaction with the environment and the calculated intrinsic reward value.

In a third aspect, the disclosed embodiments provide a computer device, including a memory and a processor, where the memory stores computer readable instructions, and the computer readable instructions, when executed by the processor, cause the processor to execute the steps of the reinforcement learning method for sparse reward environments provided in the above embodiments.

In a fourth aspect, the present disclosure provides a storage medium storing computer readable instructions, which when executed by one or more processors, cause the one or more processors to perform the steps of the reinforcement learning method for sparse reward environments provided by the above embodiments.

The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:

the reinforcement learning method for the sparse reward environment provided by the embodiment of the disclosure can accurately judge whether the current environment state can be influenced by random noise according to the similarity matrix of the environment state, when the influence of the random noise is detected, the environment familiarity is calculated by using the environment familiarity model, and the value of the environment familiarity is used as an internal reward value; when the condition that the influence of random noise cannot be received is detected, calculating a prediction error value by adopting a curiosity model, and taking the prediction error value as an internal reward value; and carrying out algorithm training according to the obtained reward value. The method has strong adaptability and universality, can be used for algorithms based on value functions or strategies, and is not influenced by online or offline methods; the problem of state prediction model failure caused by environmental noise is solved, a new complementary calculation of the internal reward value is provided, and the strategy can be rapidly and effectively learned under the condition that external rewards are sparse or do not exist.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention, as claimed.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is a diagram illustrating an implementation environment of a reinforcement learning method for a sparse reward environment, according to an exemplary embodiment;

FIG. 2 is a diagram illustrating an internal structure of a computer device in accordance with one illustrative embodiment;

FIG. 3 is a flow diagram illustrating a reinforcement learning method for a sparse reward environment, according to an exemplary embodiment;

FIG. 4 is a flow diagram illustrating a reinforcement learning method for a sparse reward environment, according to an exemplary embodiment;

FIG. 5 is a schematic diagram illustrating a fully-connected neural network model in accordance with an exemplary embodiment;

FIG. 6 is a schematic diagram illustrating a reinforcement learning apparatus for a sparse reward environment, according to an exemplary embodiment;

FIG. 7 is a schematic diagram illustrating a Monte-Darmavengeant game interface, according to an example embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

It will be understood that, as used herein, the terms "first," "second," and the like may be used herein to describe various elements, but these elements are not limited by these terms. These terms are only used to distinguish one element from another. For example, a first field and algorithm determination module may be referred to as a second field and algorithm determination module, and similarly, a second field and algorithm determination module may be referred to as a first field and algorithm determination module, without departing from the scope of the present application.

Fig. 1 is a diagram illustrating an implementation environment of a reinforcement learning method for a sparse reward environment according to an exemplary embodiment, as shown in fig. 1, in the implementation environment, including a server 110 and a terminal 120.

The server 110 is a reinforcement learning device used in a sparse reward environment, for example, a computer device such as a computer used by a technician, and the server 110 is provided with an intelligent customer service tool. The terminal 120 is installed with an application that needs reinforcement learning, and when reinforcement learning needs to be provided, a technician may send a request for providing reinforcement learning at the computer device 110, where the request carries a request identifier, and the computer device 110 receives the request to obtain a reinforcement learning method for a sparse reward environment stored in the computer device 110. And then the reinforcement learning of the intelligent agent is completed by utilizing a reinforcement learning method for the sparse rewarding environment.

It should be noted that the terminal 120 and the computer device 110 may be, but are not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like. The computer device 110 and the terminal 120 may be connected through bluetooth, USB (Universal Serial Bus), or other communication connection methods, which is not limited herein.

FIG. 2 is a diagram illustrating an internal structure of a computer device according to an exemplary embodiment. As shown in fig. 2, the computer device includes a processor, a non-volatile storage medium, a memory, and a network interface connected through a system bus. Wherein the non-volatile storage medium of the computer device stores an operating system, a database, and computer readable instructions, the database may store a sequence of control information, and the computer readable instructions, when executed by the processor, may cause the processor to implement a reinforcement learning method for a sparse reward environment. The processor of the computer device is used for providing calculation and control capability and supporting the operation of the whole computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a reinforcement learning method for a sparse reward environment. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 2 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

Monte-Zuima-vengea is important for game AI research, unlike most games in a street learning environment, most games are now easily solved with deep learning agents to achieve performance beyond human levels. But Monte-Zymda-vengean has not been solved by the deep reinforcement learning method, and its reward is relatively small. This means that the agent will receive a prize value only after a specific series of actions has been performed for a long time.

FIG. 7 is a schematic diagram of a Monte-Darmavengeant game interface according to an exemplary embodiment, as shown in FIG. 7, in the first room of Monte-Darmavengeant, an agent is going to descend a ladder, jump over an open air with a rope, descend another ladder, jump over a moving enemy, and finally climb up another ladder. All this is done only in order to get the first key in the first room, and in the first pass of the game, there are 23 such rooms in which the agent has to get all the keys. More complicated, the conditions that lead to failure in a game are also quite severe, and the intelligence can lead to death due to many possible events, the most tiring of which is to fall from a high place. Therefore, in the game, the environmental reward of the intelligent agent is very sparse, and the strategy learning is greatly influenced.

The embodiment of the disclosure aims at the problem that the external reward value of the Monte-Zuma game is extremely sparse, and the problem is solved by increasing the exploration of the intelligent agent on the environment, and the intelligent agent has high curiosity for the unfamiliar environment and wants to explore more. The embodiment of the disclosure takes the familiarity of the environment as the internal reward, and an internal reward value can be obtained after each decision, so that the problem of sparse reward is solved.

The reinforcement learning method for the sparse reward environment provided by the embodiment of the present application will be described in detail below with reference to fig. 3 to 5. The method may be implemented in dependence on a computer program, operable on a data transmission device based on the von neumann architecture. The computer program may be integrated into the application or may run as a separate tool-like application.

Referring to fig. 3, a flow chart of a reinforcement learning method for a sparse reward environment is provided for an embodiment of the present application, and as shown in fig. 3, the method of the embodiment of the present application may include the following steps:

s301, the action is interacted with a plurality of current environment states respectively to obtain a plurality of environment states at the next moment.

In a possible implementation manner, before performing step S301, the method further includes acquiring a current environment state, and copying the acquired current environment state to obtain a plurality of current environment states.

Specifically, if the current environmental status s_tWhen affected by environmental randomness, then the next state is irrelevant to the decision and is unpredictable. The embodiment of the disclosure copies the environment of the current state by m parts, and performs the action a_tInteract with the environment respectively to obtain m next states s_t+1,1,s_t+1,2,s_t+1,3,······s_t+1,m]Wherein m is a positive integer greater than 2, and the value of m can be set by a person skilled in the art.

S302, calculating the similarity of the environment state at the next moment to obtain a similarity matrix.

Further, according to the obtained environment states at a plurality of next moments, the similarity among the states is analyzed. In an optional embodiment, for an agent in a game, the state of the agent is an interface image of the game, the similarity of the state is judged to be the similarity of the image, and the similarity can be calculated according to whether the colors of pixel points at the same position on three channels of the color image are the same. For example, the percentage of the number of identical pixels to the total number of pixels is the similarity of the two states. The similarity between two environmental states can be calculated as follows:

where M represents the number of all pixels in a state, s_iDenotes the i state, s_jDenotes the j state, s_i,kThe kth pixel point, s, representing the ith state_jkThe kth pixel point representing the jth state.

According to the formula, the state similarity between every two states can be calculated. If the game is not an agent in the game, a person skilled in the art may use other methods to calculate the similarity of the external environment states, and the embodiment of the present disclosure is not limited in particular.

Further, the similarity between each two m states is formed into an m x m matrix to obtain a similarity matrix. Representing the similarity between each two states.

S303, judging whether the current environment state is influenced by random noise according to the similarity matrix.

In an optional embodiment, judging whether the current environment state is affected by the random noise according to the similarity matrix includes calculating a sum of the similarity matrices, and determining that the current environment state is affected by the random noise if the sum of the similarity matrices is smaller than a preset threshold.

In one possible implementation, the same action a is used_tAnd respectively interacting with the same copied environment states, and if the environment states are not influenced by random noise, obtaining the similarity between the environment states at the next moment, so that whether the current environment state is influenced by the random noise is judged according to the similarity matrix, and the closer the element of the matrix is to 1, the more similar the two states involved by the element are.

Specifically, the sum of the calculated similarity matrices is then set up with a threshold, and a person skilled in the art can set the threshold according to actual conditions, but the embodiment of the present disclosure is not limited specifically, and if the sum of the similarity matrices is greater than or equal to the preset threshold, it indicates that the similarity between the states is relatively large, and the current environment state is not affected by random noise. If the sum of the similarity matrixes is smaller than a preset threshold value, the influence of random noise is generated, and the state does not make sense by using a next state prediction model.

Further, data of next state prediction with noise is removed, and the next state prediction model is not updated by using the state, so that the training of the model is prevented from being influenced.

According to the step, whether the current environment state is influenced by random noise can be definitely calculated according to the sum of the similarity matrixes.

S304, if the current environment state is influenced by random noise, calculating the intrinsic reward value according to a preset environment familiarity model.

In one possible implementation, for states where the next state model loses meaning, the disclosed embodiments provide a model based on environmental familiarity to calculate intrinsic reward values.

Specifically, if the current environmental state is affected by random noise, the intrinsic reward value is calculated through a preset environmental familiarity model, including: randomly acquiring a preset number of historical environment states from the historical environment states in a random sampling mode; constructing a random similarity matrix according to the similarity of the randomly acquired historical environment state and the current environment state; and constructing an environment familiarity model according to the random similarity matrix.

The embodiment of the disclosure provides a method for calculating a supplementary intrinsic reward value, when determining that the current environmental state can be influenced by random noise, calculating the intrinsic reward value through an environmental familiarity model, wherein the environmental familiarity model calculates the similarity of the states. Specifically, the similarity between the current state and the historical environmental states is mainly calculated, and if all the historical states which appear before are calculated, the calculation amount is too large, so the embodiment of the disclosure provides a random sampling mode, a preset number of environmental states are randomly acquired from the historical environmental states, and the state similarity between the current environmental state St and the randomly sampled historical environmental states St-1, St-2. A random similarity matrix is then constructed.

Further, an environment familiarity model is constructed according to the random similarity matrix. Firstly, the obtained random similarity matrixes are summed to obtain the sum of the random similarity matrixes, the larger the value is, the more similar the description is, the more familiar the environment is, and the value of exploration is not available, so a function is designed, the smaller the state similarity is, the larger the intrinsic reward value is, and the intelligence is encouraged to explore the environment.

In an alternative embodiment, the intrinsic reward value is calculated by an environmental familiarity model, and the intrinsic reward value may be calculated according to the following formula:

With this formula, the smaller the context state similarity, the greater the intrinsic reward value, encouraging the agent to explore the context.

In an optional embodiment, if the current environment state is not affected by random noise, the method further includes: the intrinsic reward value is calculated based on a curiosity method. The method comprises the steps of adopting a neural network to construct an environment model, inputting a current environment state and a current action into the environment model, outputting a predicted value of a next state, calculating a prediction error according to the predicted value of the next state, and taking the prediction error as an internal reward value.

Specifically, FIG. 5 is a diagram illustrating a fully-connected neural network model, according to an exemplary embodiment, in which a state is input as a state s by evaluating a degree of novelty of the state through a constructed environment model M during a training process_tAnd action a_tTo predict the state s_t+1The specific method comprises the following steps:

first, an environment model M is constructed, which can be constructed using a fully connected neural network whose inputs are the current state s_tAnd current action a_tThe output is the predicted value M(s) of the next state_t,a_t) The prediction error in the network is then calculated by:

is to encode the state, if the error is bigger, the more novel the environment is, the greater the curiosity is, therefore, the error e(s) will be predicted_t,a_t) As an intrinsic prize value.

According to the step, when the influence of random noise is detected, the environment familiarity degree is calculated by using an environment familiarity degree model, and the value of the environment familiarity degree is used as an internal reward value; and when the condition that the influence of random noise cannot be detected is detected, calculating a prediction error value by adopting a curiosity model, and taking the prediction error value as an internal reward value.

S305, strategy learning is carried out according to experience data generated by interaction with the environment and the calculated intrinsic reward value.

Further, the global reward is obtained by adding the calculated intrinsic reward value to the extrinsic reward value, the extrinsic reward value being zero at almost all times, so that the agent is still driven by the intrinsic reward for policy learning. The agent uses the intrinsic reward value calculated in step S304 to apply it to the agent learning function, and continuously iteratively updates the learning function until a better learning strategy is obtained.

Thus, the intelligent can calculate the internal reward under the condition that the external reward is sparse, thereby quickly exploring the environment and effectively learning the strategy.

For a more detailed understanding of the reinforcement learning method for sparse reward environment provided by the embodiments of the present disclosure, the following description is made in conjunction with fig. 4.

As shown in FIG. 4, first, m copies of the current environment st are made, and the current environment st is interacted with each environment state by the action at to obtain m next-time states [ s ]_t+1,1,s_t+1,2,s_t+1,3,······s_t+1,m]Then, state similarity between every two m states is calculated, the obtained state similarity forms a matrix, the sum of the calculated state similarity matrixes is larger than a preset threshold value, the fact that the similarity between the states is larger and the environment is not influenced by random noise is indicated, the curiosity model can be continuously used for calculating the intrinsic reward value, and the fact that the similarity between the states is smaller and the environment is influenced by the random noise is indicated if the sum is smaller than the preset threshold value, the environment familiarity model provided by the embodiment of the disclosure can be used for calculating the intrinsic reward value.

Further, when it is determined that the current environmental state is affected by random noise, the intrinsic reward value is calculated by an environmental familiarity model, which calculates the similarity of the states. The similarity between the current state and the historical environment state is mainly calculated, and if all the historical states which appear before are calculated, the calculation amount is too large, so that the embodiment of the disclosure provides a random sampling mode, a preset number of environment states are randomly acquired from the historical environment states, and the state similarity between the current environment state St and the randomly sampled historical environment states St-1, St-2. A random similarity matrix is then constructed.

Further, an environment familiarity model is constructed according to the random similarity matrix. Firstly, the obtained random similarity matrixes are summed to obtain the sum of the random similarity matrixes, the larger the value is, the more similar the description is, the more familiar the environment is, and the value of exploration is not available, so that an inverse proportion function can be designed, the smaller the state similarity is, the larger the intrinsic reward value is, and the intelligence is encouraged to explore the environment.

In one possible implementation, the intrinsic reward value is calculated by the environmental familiarity model, and may be calculated according to the following formula:

If it is determined that the current environment is not affected by random noise, a curiosity model may be employed to calculate the intrinsic reward value. First, an environment model M is constructed, which can be constructed using a fully connected neural network whose inputs are the current state s_tAnd current action a_tThe output is the predicted value M(s) of the next state_t,a_t) The prediction error in the network is then calculated by:

The reinforcement learning method for the sparse reward environment provided by the embodiment of the disclosure has strong adaptability and universality, can be used for value function-based or strategy-based algorithms, and is not influenced by online or offline methods. The problem of state prediction model failure caused by environmental noise is solved, and a new complementary calculation of the internal reward value is provided. The strategy can be learned quickly and effectively under the condition that the external reward is sparse or does not exist.

The following are embodiments of the apparatus of the present application that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.

Referring to fig. 6, a schematic structural diagram of a reinforcement learning apparatus for a sparse reward environment according to an exemplary embodiment of the present invention is shown. As shown in fig. 6, the reinforcement learning apparatus for a sparse reward environment may be integrated in the computer device 110, and specifically may include an interaction module 601, a first calculation module 602, a determination module 603, a second calculation module 604, and a policy learning module 605.

The interaction module 601 is configured to interact the action with a plurality of current environment states respectively to obtain a plurality of environment states at a next moment;

a first calculating module 602, configured to calculate a similarity of the environmental state at the next time to obtain a similarity matrix;

a judging module 603, configured to judge whether the current environment state is affected by random noise according to the similarity matrix;

a second calculating module 604, configured to calculate an intrinsic reward value through a preset environment familiarity model when the current environment state may be affected by random noise;

and a policy learning module 605 for performing policy learning according to the experience data generated by the interaction with the environment and the calculated intrinsic reward value.

In an optional embodiment, the system further includes a copying module, configured to copy the acquired current environment state to obtain multiple current environment states before the action interacts with the multiple current environment states, respectively.

In an optional embodiment, the determining module 603 is specifically configured to calculate a sum of similarity matrices; and if the sum of the similarity matrixes is smaller than a preset threshold value, determining that the current environment state is influenced by random noise.

In an optional embodiment, the system further comprises a third calculation module, which is used for calculating the intrinsic reward value based on a curiosity method when the current environment state is not influenced by random noise.

In an optional embodiment, the third computing module is specifically configured to construct the environment model using a neural network; inputting the current environment state and the current action into an environment model, and outputting a predicted value of the next state; and calculating a prediction error according to the predicted value of the next state, and taking the prediction error as an internal reward value.

In an optional embodiment, the second calculation module is specifically configured to randomly acquire a preset number of historical environment states from the historical environment states in a random sampling manner; constructing a random similarity matrix according to the similarity of the randomly acquired historical environment state and the current environment state; and constructing an environment familiarity model according to the random similarity matrix.

In an alternative embodiment, the second calculation module is specifically configured to calculate the intrinsic reward value according to the following formula,

It should be noted that, when the reinforcement learning apparatus for a sparse reward environment provided in the foregoing embodiment executes the reinforcement learning method for a sparse reward environment, the division of the functional modules is merely used as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the reinforcement learning device for the sparse rewarding environment and the reinforcement learning method for the sparse rewarding environment provided by the above embodiments belong to the same concept, and details of implementation processes are shown in the method embodiments and are not described herein again.

In one embodiment, a computer device is proposed, the computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program: respectively interacting the actions with a plurality of current environment states to obtain a plurality of environment states at the next moment; calculating the similarity of the environment state at the next moment to obtain a similarity matrix; judging whether the current environment state is influenced by random noise according to the similarity matrix; if the current environment state can be influenced by random noise, calculating an intrinsic reward value through a preset environment familiarity model; and performing strategy learning according to empirical data generated by the interaction with the environment and the calculated intrinsic reward value.

In one embodiment, a storage medium is provided that stores computer-readable instructions that, when executed by one or more processors, cause the one or more processors to perform the steps of: respectively interacting the actions with a plurality of current environment states to obtain a plurality of environment states at the next moment; calculating the similarity of the environment state at the next moment to obtain a similarity matrix; judging whether the current environment state is influenced by random noise according to the similarity matrix; if the current environment state can be influenced by random noise, calculating an intrinsic reward value through a preset environment familiarity model; and performing strategy learning according to empirical data generated by the interaction with the environment and the calculated intrinsic reward value.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above examples only show some embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims

1. A reinforcement learning method for sparse reward environments, comprising:

calculating the similarity of the environmental state at the next moment to obtain a similarity matrix;

judging whether the current environment state is influenced by random noise or not according to the similarity matrix;

2. The method of claim 1, wherein prior to interacting the actions with the plurality of current environmental states, respectively, further comprising:

3. The method of claim 1, wherein determining whether the current environmental state is affected by random noise according to the similarity matrix comprises:

calculating the sum of the similarity matrixes;

4. The method of claim 3, wherein if the current environmental state is not affected by random noise, further comprising:

the intrinsic reward value is calculated based on a curiosity method.

5. The method of claim 4, wherein calculating the intrinsic reward value based on a curiosity method comprises:

adopting a neural network to construct an environment model;

inputting the current environment state and the current action into the environment model, and outputting a predicted value of the next state;

6. The method of claim 1, wherein calculating the intrinsic reward value through a predetermined environmental familiarity model if the current environmental status is affected by random noise comprises:

constructing a random similarity matrix according to the similarity between the randomly acquired historical environment state and the current environment state;

and constructing the environment familiarity model according to the random similarity matrix.

7. The method of claim 6, wherein calculating an intrinsic reward value via the environmental familiarity model comprises:

8. A reinforcement learning apparatus for sparse reward environments, comprising:

9. A computer device comprising a memory and a processor, the memory having stored therein computer readable instructions which, when executed by the processor, cause the processor to perform the steps of the reinforcement learning method for a sparse reward environment of any one of claims 1 to 7.

10. A storage medium storing computer readable instructions which, when executed by one or more processors, cause the one or more processors to perform the steps of the reinforcement learning method for sparse reward environments of any of claims 1 to 7.