CN114841276A

CN114841276A - Data processing method and device, electronic equipment and computer readable medium

Info

Publication number: CN114841276A
Application number: CN202210522438.0A
Authority: CN
Inventors: 徐鑫; 张亮亮
Original assignee: Beijing Jingdong Qianshi Technology Co Ltd
Current assignee: Beijing Jingdong Qianshi Technology Co Ltd
Priority date: 2022-05-13
Filing date: 2022-05-13
Publication date: 2022-08-02

Abstract

The disclosure provides a data processing method and device, and relates to the technical fields of artificial intelligence, automatic driving and the like. One embodiment of the method comprises: taking the environmental state of the environment where the intelligent agent is located at the current moment as a selected state; determining a set number of environmental states in sequence before the selected state; determining a representation for estimating a local intrinsic dimension of the selected state based on the selected state and a set number of environmental states, the local intrinsic dimension being a mathematical quantity for measuring a dimension of a state space; based on the characterization, an intrinsic reward is calculated that directs the agent to act. The embodiment improves the performance of reinforcement learning.

Description

Data processing method and device, electronic equipment and computer readable medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of artificial intelligence and automatic driving technologies, and in particular, to a data processing method and apparatus, an electronic device, a computer-readable medium, and a computer program product.

Background

At present, the mainstream exploration method based on internal rewards in the sparse reward environment mainly takes an automatic supervision method as a basis, the automatic supervision method usually utilizes a deep learning model completed by training to record the state space explored by an intelligent agent, when the intelligent agent selects again, the state space once experienced for many times is distinguished according to the record of the deep learning model, and repeated exploration is avoided, but the current automatic supervision method has various defects that quantitative analysis cannot be carried out, the characteristics of reinforcement learning are not considered, and the like.

Disclosure of Invention

Embodiments of the present disclosure propose data processing methods and apparatuses, electronic devices, computer-readable media, and computer program products.

In a first aspect, an embodiment of the present disclosure provides a data processing method, where the method includes: taking the environmental state of the environment where the intelligent agent is located at the current moment as a selected state; determining a set number of environmental states in sequence before selecting a state; determining a representation formula for estimating a local intrinsic dimension of the selected state based on the selected state and a set number of environmental states, wherein the local intrinsic dimension is a mathematical quantity for measuring the dimension of a state space; based on the characterization, an intrinsic reward for directing the agent's actions is calculated.

In some embodiments, calculating the intrinsic reward for directing the agent action based on the characterization includes: estimating the access times of the intelligent agent to access the selected state to obtain a pseudo count value of the selected state; calculating a local intrinsic dimension estimation value of the selected state based on the pseudo count value and the representation formula; the local intrinsic dimension estimate is used as an intrinsic reward for directing the agent's actions.

In some embodiments, the estimating the number of times that the agent accesses the selected state to obtain the pseudo count value of the selected state includes: estimating the probability of the occurrence of the selected state in the set time period by adopting a trained first density model to obtain a first density value, wherein the first density model is obtained by training based on a set number of environmental states; estimating the probability of the occurrence of the selected state in the set time period by adopting a trained second density model to obtain a second density value, wherein the second density model is obtained by training based on the selected state and the set number of environmental states; and obtaining a pseudo count value of the selected state based on the first density value and the second density value.

In some embodiments, the obtaining the pseudo count value of the selected state based on the first density value and the second density value includes: calculating information gain of the first density value and the second density value; and calculating to obtain a pseudo count value of the selected state based on the information gain.

In some embodiments, the calculating the local intrinsic dimension estimation value for the selected state based on the pseudo-count value and the characterization formula includes: splitting the environment state corresponding to the characterization formula into a pseudo count value selection state and a difference value environment state, wherein the difference value is the difference between the set number and the pseudo count value; splitting the expression comprising a first expression and a second expression by the representation, wherein the first expression corresponds to the pseudo count value selection states, and the second expression corresponds to the difference value environment states; and calculating a second expression based on the difference environmental states to obtain a local intrinsic dimension estimation value of the selected state.

In some embodiments, the above characterization is by:

wherein s is _t+1 To select the state, S _B ＝[s _t-1+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]A sequence of environmental states formed by k environmental states in order before the selected state, r _i (s _t+1 ，S _B ) Representing a sequence S of environmental states _B In, out of selection state s _t+1 From the ith to the selected state s of the most recent k environmental states _t+1 The distance of (a); r is _max (s _t+1 ，S _B ) Representing a sequence S of environmental states _B In, out of selection state s _t+1 The most recent k environmental states and the selected state s _t+1 Furthest ambient state to selected state s _t+1 Wherein k and i are natural numbers, and i is not more than k.

In some embodiments, the above characterization is by:

wherein s is _t+1 To select the state, S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]Sequence of environmental states, psi(s), formed for the k environmental states in sequence before the selected state _t+1 ) To select a selected feature of a state, r _i (ψ(s _t+1 )，ψ(S _B ) Representing a sequence S of environmental states _B The distance from the environment feature which is farthest from the selected feature to the selected feature in the k environment features closest to the selected feature, wherein the environment feature is the feature of the environment state; r is _max (ψ(s _t+1 )，ψ(S _B ) Representing a sequence of environmental states S _B And the distance from the environment feature which is farthest from the selected feature to the selected feature in k environment features which are closest to the selected feature, wherein k and i are natural numbers, and i is not more than k.

In some embodiments, the selected features and the environmental features are obtained by training a trained feature transformation model, and the feature transformation model is obtained by training through the following steps: performing characteristic transformation on the selected state and the environmental state adjacent to the selected state in the environmental state sequence to obtain a selected characteristic and an adjacent characteristic; sending the environment state adjacent to the selected state to the intelligent agent to obtain the behavior state output by the intelligent agent; obtaining a predicted value of the selected feature based on the behavior state and the adjacent feature; obtaining a predicted value of the behavior state based on the adjacent features and the selected features; and adjusting parameters of the feature transformation network based on the predicted value of the selected feature, the predicted value of the behavior state and the behavior state to obtain a feature transformation model.

In some embodiments, the agent comprises an autonomous vehicle, and the environmental state comprises: an operating state of the autonomous vehicle; the method further comprises the following steps: a behavioral state of the autonomous vehicle is determined based on the picked-up state and the intrinsic reward.

In a second aspect, an embodiment of the present disclosure provides a data processing apparatus, including: the intelligent agent selecting unit is configured to select the environment state of the environment where the intelligent agent is located at the current moment as a selected state; a determination unit configured to determine a set number of environmental states in order before a selected state; a dimension characterization unit configured to determine a characterization formula for estimating a local intrinsic dimension of the selected state based on the selected state and a set number of environmental states, the local intrinsic dimension being a mathematical quantity that measures a dimension of a state space; and the computing unit is configured to compute the intrinsic reward for guiding the action of the intelligent agent based on the characterization formula.

In some embodiments, the dimension characterizing unit includes: the estimation subunit is configured to estimate the access times of the agent accessing the selection state to obtain a pseudo count value of the selection state; the calculation subunit is configured to calculate a local intrinsic dimension estimation value of the selected state based on the pseudo count value and the representation formula; as a subunit, is configured to use the local intrinsic dimension estimate as an intrinsic reward for directing the agent's action.

In some embodiments, the estimating subunit includes: the first estimation module is configured to estimate the probability of the occurrence of the selected state in the set time period by adopting a trained first density model to obtain a first density value, and the first density model is obtained based on a set number of environmental states; the second estimation module is configured to estimate the probability of the occurrence of the selected state in the set time period by adopting a trained second density model to obtain a second density value, and the second density model is obtained by training based on the selected state and the set number of environmental states; a deriving module configured to derive a pseudo count value for the selection state based on the first density value and the second density value.

In some embodiments, the obtaining module includes: a gain sub-module configured to calculate an information gain for the first density value and the second density value; and the calculation sub-module is configured to calculate a pseudo count value of the selected state based on the information gain.

In some embodiments, the calculating subunit includes: the state splitting module is configured to split the environment states corresponding to the characterization formulas into a pseudo count value selecting state and a difference value environment state, wherein the difference value is the difference between the set number and the pseudo count value; a formula splitting module configured to split the representation into expressions including a first representation corresponding to the pseudo count value selection states and a second representation corresponding to the difference value environment states; and the calculation module is configured to calculate the second expression based on the difference environmental states to obtain a local intrinsic dimension estimation value of the selected state.

In some embodiments, the above characterization is by:

wherein s is _t+1 To select the state, S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]A sequence of environmental states formed by k environmental states in order before the selected state, r _i (s _t+1 ，S _B ) Representing a sequence S of environmental states _B In, out of selection state s _t+1 From the ith to the selected state s of the most recent k environmental states _t+1 The distance of (d); r is _max (s _t+1 ，S _B ) Representing a sequence S of environmental states _B In, out of selection state s _t+1 The most recent k environmental states and the selected state s _t+1 Furthest ambient state to selected state s _t+1 Wherein k and i are natural numbers, and i is not more than k.

In some embodiments, the above characterization is by:

In some embodiments, the agent comprises an autonomous vehicle, and the environmental state comprises: an operating state of the autonomous vehicle; the apparatus is further configured to determine a behavioral state of the autonomous vehicle based on the selected state and the intrinsic reward.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including: one or more processors; a storage device having one or more programs stored thereon; when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method as described in any implementation of the first aspect.

In a fourth aspect, embodiments of the present disclosure provide a computer-readable medium on which a computer program is stored, which when executed by a processor implements the method as described in any of the implementations of the first aspect.

In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program that, when executed by a processor, implements a method as described in any of the implementations of the first aspect.

According to the data processing method and device provided by the embodiment of the disclosure, firstly, the environmental state of the environment where the intelligent agent is located at the current moment is taken as a selection state; secondly, determining a set number of environmental states in sequence before selecting the states; thirdly, determining a representation formula for estimating local intrinsic dimensions of the selected state based on the selected state and the set number of environment states, wherein the local intrinsic dimensions are mathematical quantities for measuring dimensions of the state space; finally, based on the characterization formula, an intrinsic reward for directing the agent's actions is calculated. From this, when the agent is interactive with the environment, estimate the dimension of the state space that self is located at present, based on the dimension of the state space that the estimation obtained, confirm the internal reward of agent to encourage the agent to seek the higher state space of dimension as far as possible, help the agent to carry out better exploration to the environment, improved the efficiency of reinforcement study.

Drawings

Other features, objects and advantages of the disclosure will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is an exemplary system architecture diagram in which one embodiment of the present disclosure may be applied;

FIG. 2 is a flow diagram for one embodiment of a data processing method according to the present disclosure;

FIG. 3 is a schematic diagram of a feature model training framework according to an embodiment of the present disclosure;

FIG. 4a is a schematic diagram of an external true reward according to an embodiment of the present disclosure;

FIG. 4b is a schematic illustration of an intrinsic reward according to an embodiment of the present disclosure;

FIG. 5 is a schematic block diagram of an embodiment of a data processing apparatus according to the present disclosure;

FIG. 6 is a schematic block diagram of an electronic device suitable for use in implementing embodiments of the present disclosure.

Detailed Description

The present disclosure is described in further detail below with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that, in the present disclosure, the embodiments and features of the embodiments may be combined with each other without conflict. The present disclosure will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

Fig. 1 illustrates an exemplary system architecture 100 to which the data processing method of the present disclosure may be applied.

Fig. 1 shows an exemplary system architecture 100 to which embodiments of the data processing method of the present application may be applied.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the

terminal devices

101, 102, 103 and the server 105. The network 104 may include various connection types, and typically may include wireless communication links and the like.

The

terminal devices

101, 102, 103 interact with a server 105 via a network 104 to receive or send messages or the like. Various communication client applications, such as an instant messaging tool, a mailbox client, etc., can be installed on the

terminal devices

101, 102, 103.

The

terminal apparatuses

101, 102, and 103 may be hardware or software. When the

terminal devices

101, 102, 103 are hardware, they may be acquisition devices that acquire environmental states having communication and control functions. The acquisition devices may be in communication with the server 105. When the

terminal apparatus

101, 102, 103 is software, it can be installed in the above-described terminal. It may be implemented as a plurality of software or software modules (e.g., software or software modules for collecting environmental conditions) or as a single software or software module. And is not particularly limited herein.

The server 105 may be a server providing various services, such as a bonus server providing support for agents on the

terminal devices

101, 102, 103. The reward server can analyze and process the relevant information of each terminal in the network and feed back the processing result (such as the intrinsic reward) to the terminal equipment.

The server may be hardware or software. When the server is hardware, it may be implemented as a distributed server cluster formed by multiple servers, or may be implemented as a single server. When the server is software, it may be implemented as multiple pieces of software or software modules (e.g., software or software modules used to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.

It should be noted that the data processing method provided by the embodiment of the present disclosure is generally executed by the server 105.

Reinforcement Learning (RL), also known as refinish Learning, evaluative Learning or Reinforcement Learning, is one of the paradigms and methodologies of machine Learning, and is used to describe and solve the problem that agents (agents) can achieve maximum return or achieve specific goals through Learning strategies in the process of interacting with the environment. For the exploration problem in the process of the automatic driving reinforcement learning, the sparse rewarding environment and the non-sparse rewarding environment can be generally divided. For a sparse reward environment, that is, a signal of environmental reward acting as an agent in reinforcement learning is usually rare, and only in rare cases, the agent can contact the reward signal in the environment by completing a series of specific actions within a plurality of steps, and the agent cannot obtain any feedback from the environment to guide the agent to adjust and act within a long period of time, so that the action strategy cannot be effectively adjusted, the chance of obtaining the reward signal by the agent can not be improved, and the agent is trapped in a predicament. For the non-sparse reward environment, the environment can give effective reward signals of the intelligent agent as feedback more intensively, so that the intelligent agent can effectively adjust own behavior strategies according to the reward signals collected currently when training the intelligent agent per time, the adjusted strategies can help the intelligent agent to obtain more intensive reward signals, and the further reward signals can help the intelligent agent to adjust own strategies better and establish a benign cycle.

For complex robot control tasks, the existing reinforcement learning method usually needs to have satisfactory performance, a lot of domain knowledge is added to a specific task, a reward function and a training method are carefully designed, and when an agent comes to a new environment, related design is usually needed to be carried out again according to the task.

In view of the above drawbacks, the data processing method provided by the present disclosure proposes, for a sparse reward environment, an intrinsic reward based on a state space dimension: the method has the advantages that the local intrinsic dimension of the state space around the intelligent agent is estimated at each moment, the intrinsic reward of the intelligent agent in the process of exploring the environment is determined based on the estimated dimension, the intelligent agent is helped to explore the environment better, and the efficiency of reinforcement learning is improved. Referring to FIG. 2, a flow 200 of one embodiment of a data processing method according to the present disclosure is shown, the data processing method comprising the steps of:

step 201, taking the environmental state of the environment where the agent is located at the current moment as the selected state.

In this embodiment, the agent is an agent that runs a reinforcement learning method, and the agent performs reinforcement learning and then performs an action on a controlled object (for example,

terminal devices

101, 102, and 103 in fig. 1), so that the controlled object can be controlled maximally and effectively. In practical application scenarios, reinforcement learning of an agent faces an important problem: the agent may not receive sufficient and effective rewards or the agent may receive sparse rewards, which may result in the agent learning slowly and ineffectively. It should be noted that the executing agent on which the data processing method is executed may be the above-mentioned agent, or may be an agent other than the agent, and the execution agent calculates the intrinsic reward to provide effective support for the action of the agent. For example, under the condition of sparse reward of automatic driving, the intelligent agent is provided with effective internal reward, and the intelligent agent is helped to execute effective action on the controlled object.

In this embodiment, the selected state may be an environmental state of an environment where the agent is located when performing reinforcement learning, the agent executes different tasks, and the environmental states corresponding to the tasks are different, for example, for an automatic driving task, the environmental state of the agent may include: vehicle operating conditions of the autonomous vehicle (e.g., vehicle speed, fuel consumption) and environmental information around the autonomous vehicle (e.g., congestion factor); for virtual game control, the environmental state of the agent may then include: the running state of the virtual game main body (such as the speed of the virtual vehicle). The current time can be any time of the intelligent agent in the current environment, and by calculating the intrinsic reward of the environment state at each time, the intrinsic reward of the state space around the intelligent agent can be continuously estimated at each time, so that the intelligent agent can be helped to better explore the environment.

In step 202, a set number of environmental states in sequence before the selected state is determined.

For most reinforcement learning environments, the environment is typically round-robin. That is, similar tasks are completed in a round, and when the round is finished, the intelligence in the reinforcement learning method terminates the action of the intelligence when the task in the current round is finished. Many of the prior attempts to solve the conventional exploration problem are generally inter-turn exploration methods, i.e., whether the agent needs to explore more about an environmental condition, and the metric is derived from the comparison between data of all turns used by the agent for learning and training in the training process.

The environmental state adopted by the data processing method provided by the disclosure can be used for calculating the intrinsic reward only by taking the environmental state in the turn period as a sample when estimating the local intrinsic dimension, belongs to the exploration in the turn, and can bring the generalization of the intrinsic reward estimation through the exploration in the turn.

Optionally, when estimating the local intrinsic dimension, the internal reward may be calculated by using the environmental states during a plurality of round periods and the environmental states within the round periods, which not only considers the inter-round exploration, but also considers the intra-round exploration and the inter-round exploration, and in a fixed specific environment, the environment is over-fitted, so that the local intrinsic dimension has a better performance.

In this embodiment, the set number of environmental states before the selection state is the set number of environmental states before the selection state collected in sequence according to the time sequence, the collected environmental states may be related to the collection period of the execution subject, and the set number may be determined based on the intrinsic reward calculation accuracy, for example, the set number may be 5000.

In this embodiment, the set number may also be the number of environmental states collected in a set time period, where the set time period may correspond to a task running round period of reinforcement learning of the agent, for example, the set time period is one round period of reinforcement learning, or the set time period includes multiple round periods of reinforcement learning.

Step 203, determining a representation for estimating local intrinsic dimensions of the selected state based on the selected state and a set number of environmental states.

In this embodiment, the local intrinsic dimension is a mathematical quantity for measuring the dimension of the state space. When the state spaces of the selected state and the environment state are different, the selectable characterization formulas for characterizing the local intrinsic dimensions are different; for example, in a low-dimensional state space, the characterization formula may use a linear relation to characterize the correspondence between the selected state, the environmental state, and the local intrinsic dimension. And (3) adopting a characteristic expression shown as the formula (2) in the high-dimensional state space.

In this embodiment, the local intrinsic dimension is introduced in the reinforcement learning process of the agent, so that when the agent interacts with the environment where the agent is located, the dimension of the state space of the environment where the agent is located is estimated, and meanwhile, the intrinsic reward of the agent is determined based on the obtained local intrinsic dimension of the selected state, so that the agent is encouraged to explore the state spaces of the environments with higher dimensions as much as possible in the process of interacting with the environment, and the space with higher dimensions is likely to be more worthy of exploration compared with the state spaces of other environments, and is more important in the whole environment state space. The state space of those environments lacking variations and having no more features than the surrounding environment states is usually low in local eigen-dimension estimates, of lower importance and less information. Such an approach may also help the agent better explore the environment in the absence of external real rewards, rather than consistently failing to learn meaningful action strategies in the absence of real external rewards.

In Euclidean space, a two-dimensional sphere has a volume of π r ² And the volume of a three-dimensional sphere is 4/3 π r ³ Similarly, for a D-dimensional sphere, its volume is proportional to r ^D . So from above, the dimension of the sphere can be estimated from the ratio of the change in radius to the change in volume as follows:

in formula (1), V ₁ 、r ₁ Respectively a volume value of the sphere and a radius, V, corresponding to the volume value ₂ 、r ₂ Another volume value after the change of the sphere volume and the radius corresponding to the volume value.

The estimation method for the sphere intrinsic dimension is migrated to the probability density, and the formal definition of the local intrinsic dimension is that for a given data sample X epsilon X, wherein X is the whole data set, r > 0 is defined as a random variable representing the distance from X to other sample data in the whole data set, the cumulative probability density function F (r) is positive and continuous and differentiable, and for the sample X at the distance r, the local intrinsic dimension is as shown in formula (2):

if a limit exists, the local eigendimension at point x can be defined as:

where ε is the amount of change in the distance from x to other sample data in formula (2).

When estimating the local intrinsic dimension, it is not practical to perform an operation on all data. Alternatively, a partial sample may be sampled from the entire data set

By partial sample X _B Estimating local intrinsic dimension, as shown in formula (2), g (x) representing characteristic transformation of the sample, r _i (g(x)，g(X _B ) ) represents a partial sample X _B And the distance from the ith sample in the k (k is more than 0 and k is a natural number) samples closest to the sample x. r is _max (g(x)，g(X _B ) ) represents a partial sample X _B The distance from the sample farthest from x to x among the k samples closest to x.

Based on the characterization, an intrinsic reward for directing the agent's actions is calculated, step 204.

In the embodiment, in the reinforcement learning method, the reward for guiding the action of the intelligent agent comprises an external real reward and an internal reward, and the external real reward is used for measuring the performance of the intelligent agent and cannot be changed generally. The intrinsic reward is internal to the agent and is used for training the agent action strategy in one life cycle of the agent, the parameters of the agent can be updated through the intrinsic reward, and the intrinsic reward obtained through the data processing method of the embodiment can help the agent to effectively explore in the current environment state and determine the effective action strategy.

In the reinforcement learning task of the intelligent agent, the local intrinsic dimension estimated by a plurality of continuous environmental states can be used as the internal reward for guiding the intelligent agent to act, and the intelligent agent can obtain satisfactory control effect under the condition of no external real reward, so that a meaningful action strategy can be learned when the intelligent agent does not obtain effective external real reward.

In this embodiment, based on different characterization expressions, the selected state and a set number of environmental states may be input into the characterization expressions, a local intrinsic dimension is obtained through calculation, and the local intrinsic dimension is used as an internal reward corresponding to the selected state. Optionally, feature transformation may be performed on the selected state and a set number of environment states, a characterization expression is calculated through transformed features to obtain a local intrinsic dimension, and the local intrinsic dimension is used as an internal reward corresponding to the selected state.

Optionally, after the local intrinsic dimension is obtained, the local intrinsic dimension may be weighted or subjected to coefficient addition to obtain the intrinsic reward corresponding to the selected state.

In this embodiment, for an agent at a time t before the current time t +1, the observed environment state is s _t According to the strategy pi (a | s) of the intelligent agent (pi model shown in FIG. 4), the corresponding environment state s is obtained _t Behavior state of a _t Receive external awards

At the same time, the environment is shifted to the environment state s at the current time t +1 _t+1 At this time, the environmental state at the current time can be used as the selected state, the local intrinsic dimension of the selected state is estimated according to the characterization formula of the local intrinsic dimension, such as the formula (2) or the formula (3), that is, the local intrinsic dimension is estimated, and the local intrinsic dimension is estimated to be the intrinsic reward of the selected state

In some optional implementations of this embodiment, the local eigen-dimension estimation may use the characterization formula shown in formula (4), which is to sample partial samples from the entire data set

x represents the need to proceedEstimated samples, r _i (x，X _B ) Representing a partial sample X _B The distance from the ith sample of the k samples closest to sample x. r is _max (x，X _B ) Representing a partial sample X _B The distance from the sample farthest from x to x among the k samples closest to x.

The local eigen-dimension estimate for sample x is:

for the

For the calculation of (1), all the past environmental states experienced by the agent are all data, from which a partial sample is to be sampled and the state s is selected _t+1 Estimating local intrinsic dimensions with k samples nearest to the local intrinsic dimensions, the k past environment states of the agent in the current round can be selected to represent and select the state s _t+1 The most recent k samples, i.e. the sequence of environmental states S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]Thereby obtaining information about the selected state s _t+1 Local eigen-dimension estimation of, i.e.

Where k is a hyperparameter.

In formula (5), r _i (s _t+1 ，S _B ) Representing a sequence S of environmental states _B In, out of selection state s _t+1 From the ith to the selected state s of the most recent k environmental states _t+1 The distance of (d); r is _max (s _t+1 ，S _B ) Representing a sequence S of environmental states _B In, out of selection state s _t+1 The most recent k environmental states and the selected state s _t+1 Furthest ambient state to selected state s _t+1 Wherein k and i are natural numbers, and i is not more than k.

In an optional implementation manner of this embodiment, equation (5) is used as a characterization equation for estimating a local intrinsic dimension of the selected state, and the local intrinsic dimension obtained by calculating the characterization equation can be directly used as an intrinsic reward corresponding to the selected filling, so as to provide reliable guidance for the action of the agent.

Optionally, a sequence of environmental states S _B The environmental state in (b) may also be an environmental state of the agent between rounds.

When the internal reward is calculated in a high-dimensional state space, it is not appropriate to directly estimate the internal reward by using the environmental state, and a certain method needs to be adopted to transform the characteristics of the original environmental state input, in another optional implementation manner of this embodiment, the characterization formula may be:

wherein, in the formula (6), s _t+1 To select the state, S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]A sequence of environmental states, psi(s), formed by k (k > 0) environmental states in order before the selected state _t+1 ) To select a selected feature of a state, r _i (ψ(s _t+1 )，ψ(S _B ) Representing a sequence S of environmental states _B The distance from the environment feature which is farthest from the selected feature to the selected feature in the k environment features closest to the selected feature, wherein the environment feature is the feature of the environment state; r is _max (ψ(s _t+1 )，ψ(S _B ) Representing a sequence S of environmental states _B And the distances from the environment feature which is the farthest from the selected feature to the selected feature in the k environment features which are closest to the selected feature, wherein k and i are natural numbers, and i is less than or equal to k.

According to the selection state s _t+1 And k samples nearest to it, realigning local eigen-dimensionsAn estimation is made wherein k past states of the agent in the current round or rounds may be selected to represent s _t+1 The most recent k samples, state S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]Thereby obtaining information about the selected state s _t+1 Local eigen-dimension estimation. The difference between the formula (6) and the formula (5) is that the state s is selected _t+1 And S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]Respectively become psi(s) _t+1 ) And psi(s) _B )＝[ψ(s _t-k+1 )，ψ(s _t-k+2 )，...，ψ(s _t-1 )，ψ(s _t )]。

In this optional implementation, the selected characteristics and the environmental characteristics may be directly obtained through characteristic engineering.

Optionally, the feature extraction may be directly performed on the selected state and the environmental states in the environmental state sequence through a pre-trained feature transformation model, so as to obtain the selected feature and the environmental features of each environmental state in the environmental state sequence.

In an optional implementation manner of this embodiment, the expression (6) is used as a characterization expression for estimating a local intrinsic dimension of the selected state, and the local intrinsic dimension obtained by calculating the characterization expression can be directly used as an internal reward corresponding to the selected load, so that a reliable guidance manner is provided for guiding the behavior of the agent under the condition of a high-dimensional environment state space.

In another optional implementation manner of this embodiment, the selected features and the environmental features are obtained by a trained feature transformation model, and the feature transformation model is obtained by training through the following steps:

performing characteristic transformation on the selected state and an environmental state adjacent to the selected state in the environmental state sequence by adopting a characteristic transformation network to obtain a selected characteristic and an adjacent characteristic; sending the environmental state adjacent to the selected state to the intelligent agent to obtain the behavior state output by the intelligent agent; obtaining a predicted value of the selected feature based on the behavior state and the adjacent feature; obtaining a predicted value of the behavior state based on the adjacent features and the selected features; and adjusting parameters of the feature transformation network based on the predicted value of the selected feature, the predicted value of the behavior state and the behavior state to obtain a feature transformation model.

Specifically, as shown in FIG. 3, for an environmental state s _t And selecting the state s _t+1 Using the feature transformation network T to perform feature transformation to respectively obtain adjacent features psi(s) _t ) And selecting the characteristic psi(s) _t+1 ) Defining a forward network F and a reverse network I simultaneously, and using the forward network F to pair adjacent features psi(s) _t ) And behavioral state a _t Prediction is carried out to obtain the feature psi(s) for selection _t+1 ) Predicted value of (2)

Defining forward losses of a forward network

By reverse network to adjacent features psi(s) _t ) And selecting the characteristic psi(s) _t+1 ) Predicting to obtain corresponding behavior state a _t Predicted value of (2)

Defining reverse losses for a reverse network

Calculating forward loss L based on the predicted value of the selected characteristic, the predicted value of the behavior state and the behavior state _fw And reverse loss L _inv Adjusting parameters of the feature transformation network until the forward loss L _fw And reverse loss L _inv And balancing, wherein the parameters of the current feature transformation network are not transformed any more, and the current feature transformation network T is the required feature transformation model. The feature transformation model obtained by the optional implementation mode can effectively avoid influence on the intelligent agent caused by parts irrelevant to the intelligent agent and tasks in the environment.

The data processing method provided by the embodiment can be used in all reinforcement learning scenarios, such as games, human-computer interaction, virtual control, and the like, the environment where the agent is located includes a virtual environment, and the environment state includes: and the intelligent agent determines the running strategy of the current game based on the running state of the virtual environment and the intrinsic rewards.

The data processing method provided by this embodiment may be applied to an autonomous driving environment, and in some optional implementations of this embodiment, the agent includes an autonomous driving vehicle, and the environment state includes: an operating state of the autonomous vehicle, wherein the operating state of the autonomous vehicle comprises: the data processing method comprises the following steps of automatically driving the vehicle, such as vehicle speed, fuel consumption and vehicle temperature, and the like: determining a behavioral state of the autonomous vehicle based on the selected state and the intrinsic reward, wherein the behavioral state of the autonomous vehicle comprises: left turn, right turn, straight run, stop, etc.

Optionally, the environmental state may further include: the in-vehicle and out-vehicle states of the autonomous vehicle include, for example: number of people, weight of load, etc.; the off-board conditions may include: congested roads, highways, rainy weather, etc.

In the optional implementation mode, the data processing method is applied to an automatic driving scene, a reliable means for obtaining internal rewards is provided for an intelligent agent in the automatic driving vehicle, and the reliability of control of the automatic driving vehicle is ensured.

The data processing method provided by this embodiment gives a lower local intrinsic dimension estimate (lower intrinsic reward) for state spaces that lack variation compared to the surrounding state spaces, and gives a higher dimension estimate (greater intrinsic reward) for state spaces that vary more significantly and more particularly compared to the surrounding state spaces. In this way, the agent is encouraged to try to explore a more particular state space than the surrounding state space. And the estimated dimensions give some explanation of the behavior of the agent.

According to the data processing method provided by the embodiment of the disclosure, firstly, the environmental state of the environment where the intelligent agent is located at the current moment is taken as a selection state; secondly, determining a set number of environmental states in sequence before selecting the states; thirdly, determining a representation formula for estimating local intrinsic dimensions of the selected state based on the selected state and the set number of environment states, wherein the local intrinsic dimensions are mathematical quantities for measuring dimensions of the state space; finally, based on the characterization formula, an intrinsic reward for directing the agent's actions is calculated. Therefore, when the intelligent body interacts with the environment, the dimensionality of the state space where the intelligent body is located at present is estimated, and the inherent reward of the intelligent body is determined based on the dimensionality of the state space obtained through estimation, so that the intelligent body is encouraged to explore the state space with higher dimensionality as much as possible, better exploration of the intelligent body on the environment is achieved, and the efficiency of reinforcement learning is improved.

Because the reinforcement learning environment is usually a turn system, similar tasks are completed in a turn, and the intelligent agent terminates the action when the turn is finished, in another embodiment of the disclosure, a pseudo counting method can be combined to simultaneously sample the environment state in the turn and the environment state between the turns, local environment dimension estimation is carried out on the selected state, and internal rewards are obtained, so that for different tasks, not only the exploration between the turns but also the exploration within the turns are considered, wherein the exploration within the turns can bring better generalization performance, and the exploration between the turns can be used for overfitting the environment, so that better performance can be realized.

Under the problem of continuous space, the direct counting of state actions is invalid, and the pseudo-counting-based exploration algorithm evaluates the occurrence frequency of states by designing a density model, so that the calculated pseudo-counting value can replace the real counting.

In another embodiment of the present disclosure, the calculating the intrinsic reward for directing the action of the agent based on the characterization formula includes: estimating the access times of the intelligent agent to access the selected state to obtain a pseudo count value of the selected state; calculating a local intrinsic dimension estimated value of the selected state based on the pseudo count value and the representation formula; the local intrinsic dimension estimate is used as an intrinsic reward for directing the agent's actions.

In this optional implementation, the access times to the selection state may be estimated in a reinforcement learning round of the same task or in reinforcement learning rounds of a plurality of different tasks, and the estimation of the access times to the selection state by the agent refers to estimation of the times to the selection state by the agent, and the times to the selection state experienced in the sample set may be determined by the access times estimation of the selection state.

In this optional implementation, estimating the number of times that the agent accesses the selected state, and obtaining the pseudo count value of the selected state includes: aiming at a reinforcement learning current task, determining a trained density model corresponding to the current task; and estimating the probability of the occurrence of the selected state in the set time period by adopting a density model to obtain a pseudo count value of the selected state.

In this embodiment, when the token is used to estimate the local intrinsic dimension of the selected state, in order to encourage the agent to avoid the state space that has been repeatedly experienced before, for the state space of only the selected state, the predicted pseudo count value needs to be zero, that is, when the token is calculated, the state space corresponding to each selected state needs to be discarded.

The method for calculating the intrinsic reward for guiding the intelligent agent to act provided by the embodiment of the disclosure estimates the access times of the intelligent agent accessing the selected state to obtain the pseudo count value, calculates the local intrinsic dimension estimation value based on the pseudo count value and the representation formula, and takes the local intrinsic dimension estimation value as the intrinsic reward for guiding the intelligent agent to act, so that the calculation amount for calculating the intrinsic reward can be saved, and the obtaining efficiency of the intrinsic reward is improved.

In some optional implementation manners of this embodiment, the estimating the number of times that the agent accesses the selection state to obtain the pseudo count value of the selection state includes:

estimating the probability of the occurrence of the selected state in the set time period by adopting a trained first density model to obtain a first density value, wherein the first density model is obtained by training based on a set number of environmental states; estimating the probability of the occurrence of the selected state in the set time period by adopting a trained second density model to obtain a second density value, wherein the second density model is obtained by training based on the selected state and the set number of environmental states; and obtaining a pseudo count value of the selected state based on the first density value and the second density value.

In this optional implementation manner, the obtaining the pseudo count value of the selected state based on the first density value and the second density value includes: and calculating the average value of the first density value and the second density value, and taking the average value as a pseudo count value of the selection state.

Optionally, the obtaining the pseudo count value of the selected state based on the first density value and the second density value includes: and taking the maximum value of the first density value and the second density value as a pseudo count value of the selection state.

The method for obtaining the pseudo count value of the selected state provided by the optional implementation mode estimates the probability of the selected state in the set time period by adopting the first density model and the second density model respectively, obtains the pseudo count value of the selected state based on the first density value and the second density value, and improves the reliability of obtaining the pseudo count value.

In some optional implementations of this embodiment, the obtaining the pseudo count value of the selected state based on the first density value and the second density value includes: calculating information gain of the first density value and the second density value; and calculating to obtain a pseudo count value of the selected state based on the information gain.

In this alternative implementation, the information gain is asymmetric and is used to measure the difference between the first density value and the second density value of the two probability distributions, and the information gain describes the difference between encoding using the second density value when encoding using the first density value.

In this alternative implementation, the information gain PG _t (s _t+1 )＝logρ _t ′(s _t+1 )-logρ _t (s _t+1 ) Wherein the first density value is rho _t (s _t+1 ) The second density value is rho _t ′(s _t+1 )。

In this alternative implementation, the pseudo count value

Gain-in of information

Thereby obtaining a false count value.

In the optional implementation manner, the pseudo count value is obtained by calculating the difference of the probability distribution of the first density value and the second density value, and the reliability of the obtained pseudo count value is ensured.

In some optional implementation manners of this embodiment, the calculating, based on the pseudo count value and the characterization expression, a local intrinsic dimension estimation value of the selected state includes: splitting the environment state corresponding to the characterization formula into a pseudo count value selection state and a difference value environment state, wherein the difference value is the difference between the set number and the pseudo count value; splitting the representation formula into an expression comprising a first representation formula and a second representation formula, wherein the first representation formula corresponds to the pseudo count value selection states, and the second representation formula corresponds to the difference value environment states; and calculating a second expression based on the difference environmental states to obtain a local intrinsic dimension estimation value of the selected state.

In this optional implementation manner, based on the pseudo count value, the representation formula is split into expressions including a first representation formula and a second representation formula, where the first representation formula corresponds to the selected state, and the second representation formula corresponds to the different environmental states, and a molecule based on the representation formula is represented as a distance from the environmental state to the selected state, and the first representation formula is a representation formula of the selected state, where the distance is zero, and a value obtained by the first representation formula is necessarily zero.

In this alternative implementation, for the selection state s _t+1 The k environment states corresponding to the representation formula are changed from the original S _k ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]Become a pseudo count value

A selection state s _t+1 ，

In sequence with

An environmental state

I.e. from the original in the selected state s _t+1 The previous k environment states in sequence are changed into the sequence

State and estimated by density model

A selection state s _t+1 . Thus, in calculating the selection state s _t+1 When the local intrinsic dimension of (1), i.e. the intrinsic reward, becomes (where

It is briefly described as

)

Due to the fact that

In only sample s _t+1 Then it is known that

As a result of this, the number of the,

in this alternative implementation, the first representation may be

The second expression is

k is a hyperparameter.

In the optional implementation mode, the intelligent agent can be encouraged to avoid the repeatedly-experienced state space as much as possible while referring to the local intrinsic dimension of the selected state through the split representation, and for the repeatedly-experienced environment state of the intelligent agent, a smaller internal reward can be obtained when the local intrinsic dimension is calculated; correspondingly, less environment states are experienced, relatively large intrinsic rewards are given when local intrinsic dimensions are calculated, and the embodiment considers novelty in a reinforcement learning round and novelty among the reinforcement learning round simultaneously, better utilizes mathematical concepts and provides better explanatory performance for the intrinsic rewards of the intelligent agents.

In a practical example, the bottom end of a stick is connected with a black object below, and the intelligent agent makes the stick above keep as upright as possible by moving the object below to the left and right. The longer the stick can remain upright in a round, the more external real awards can be made

As shown in fig. 4(a) and 4(b), in which only the external real bonus used in fig. 4(a) is applied

FIG. 4(b) uses only intrinsic rewards generated by local intrinsic dimension estimation

The abscissa for the images is the number of rounds in the training process and the ordinate is the true cumulative prize for each round. It can be seen from the images that for the outcome in fig. 4(a), the agent has experienced a relatively long period of almost no reward in the previous period and has grasped the effective strategy relatively quickly in the following period, while for the outcome in fig. 4(b), the agent in the previous period can form some meaningful strategies in the earlier period and realize a higher round accumulated reward in the previous period according to the guidance of the intrinsic reward.

With further reference to fig. 5, as an implementation of the method shown in the above figures, the present disclosure provides an embodiment of a data processing apparatus, which corresponds to the embodiment of the method shown in fig. 2, and which may be applied in various electronic devices in particular.

As shown in fig. 5, an embodiment of the present disclosure provides a data processing apparatus 500, the apparatus 500 including: the device comprises a selecting unit 501, a determining unit 502, a dimension representing unit 503 and a calculating unit 504. The selecting unit 501 may be configured to use an environment state of an environment where the agent is located at the current time as the selecting state. The determining unit 502 may be configured to determine a set number of environmental states in sequence before the selected state. The dimension characterization unit 503 may be configured to determine a characterization formula for estimating a local intrinsic dimension of the selected state based on the selected state and the set number of environmental states, where the local intrinsic dimension is a mathematical quantity for measuring a dimension of the state space. The computing unit 504 may be configured to compute an intrinsic reward for directing the agent to act based on the characterization formula.

In this embodiment, in the data processing apparatus 500, the specific processing of the selecting unit 501, the determining unit 502, the dimension characterizing unit 503 and the calculating unit 504 and the technical effects thereof can refer to step 201, step 202, step 203 and step 204 in the corresponding embodiment of fig. 2, respectively.

In some embodiments, the dimension characterizing unit 503 includes: an estimation subunit (not shown), a calculation subunit (not shown) as a subunit (not shown). The estimation subunit may be configured to estimate the number of times that the agent accesses the selection state, so as to obtain a pseudo count value of the selection state. The calculating subunit may be configured to calculate the local intrinsic dimension estimation value of the selected state based on the pseudo count value and the characterization expression. The foregoing, as a sub-unit, may be configured to use the local intrinsic dimension estimate as an intrinsic reward for directing the agent's action.

In some embodiments, the estimating subunit includes: a first estimation module (not shown), a second estimation module (not shown), and a derivation module (not shown). The first estimation module may be configured to estimate, by using a trained first density model, a probability of occurrence of a selected state in a set time period to obtain a first density value, where the first density model is obtained by training based on a set number of environmental states. The second estimation module may be configured to estimate, using a trained second density model, a probability of occurrence of the selected state in the set time period to obtain a second density value, where the second density model is trained based on the selected state and the set number of environmental states. The obtaining module may be configured to obtain a pseudo count value of the selected state based on the first density value and the second density value.

In some embodiments, the obtaining module includes: a gain sub-module (not shown), and a calculation sub-module (not shown). Wherein the gain sub-module may be configured to calculate an information gain of the first density value and the second density value. The calculating sub-module may be configured to calculate a pseudo-count value of the selected state based on the information gain.

In some embodiments, the calculating subunit includes: a state splitting module (not shown), a formula splitting module (not shown), and a calculating module (not shown). The state splitting module may be configured to split the environment state corresponding to the token into a plurality of selected states of a pseudo count value and a plurality of environment states of a difference value, where the difference value is a difference between the set number and the pseudo count value. The formula splitting module may be configured to split the expression including a first expression corresponding to the pseudo count value selection states and a second expression corresponding to the difference value environment states. The computing module may be configured to compute the second representation based on the difference environmental states to obtain local intrinsic dimension estimates for the selected states.

In some embodiments, the above characterization is by:

wherein s is _t+1 To select the state, S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]A sequence of environmental states formed by k environmental states in order before the selected state, r _i (s _t+1 ，S _B ) Representing a sequence S of environmental states _B In, out of selection state s _t+1 From the ith to the selected state s of the most recent k environmental states _t+1 The distance of (a); r is a radical of hydrogen _max (s _t+1 ，S _B ) Representing a sequence S of environmental states _B In, out of selection state s _t+1 The most recent k environmental states and the selected state s _t+1 Furthest ambient state to selected state s _t+1 Wherein k and i are natural numbers, and i is not more than k.

In some embodiments, the above characterization is by:

wherein s is _t+1 To select the state, S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]Sequence of environmental states, psi(s), formed for the sequential k environmental states preceding the selected state _t+1 ) To select a selected feature of a state, r _i (ψ(s _t+1 )，ψ(S _B ) Representing a sequence S of environmental states _B The distance from the environment feature which is farthest from the selected feature to the selected feature in the k environment features closest to the selected feature, wherein the environment feature is the feature of the environment state; r is a radical of hydrogen _max (ψ(s _t+1 )，ψ(S _B ) Representing a sequence S of environmental states _B And the distances from the environment feature which is the farthest from the selected feature to the selected feature in the k environment features which are closest to the selected feature, wherein k and i are natural numbers, and i is less than or equal to k.

In the data processing apparatus provided in the embodiment of the present disclosure, first, the selection unit 501 takes an environmental state of an environment where an agent is located at a current time as a selection state; next, the determining unit 502 determines a set number of environmental states in sequence before the selected state; thirdly, the dimension representation unit 503 determines a representation formula for estimating a local intrinsic dimension of the selected state based on the selected state and the set number of environment states, where the local intrinsic dimension is a mathematical quantity for measuring a dimension of the state space; finally, the calculation unit 504 calculates the intrinsic reward for directing the agent's actions based on the characterization formula. From this, when the agent is interactive with the environment, estimate the dimension of the state space that self is located at present, based on the dimension of the state space that the estimation obtained, confirm the internal reward of agent to encourage the agent to go to explore the higher state space of dimension as far as possible, help the agent to carry out better exploration to the environment, improved the efficiency of reinforcement study.

Referring now to FIG. 6, shown is a schematic diagram of an electronic device 600 suitable for use in implementing embodiments of the present disclosure.

As shown in fig. 6, electronic device 600 may include a processing means (e.g., central processing unit, graphics processor, etc.) 601 that may perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)602 or a program loaded from a storage means 608 into a Random Access Memory (RAM) 603. In the RAM603, various programs and data necessary for the operation of the electronic apparatus 600 are also stored. The processing device 601, the ROM 602, and the RAM603 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Generally, the following devices may be connected to the I/O interface 605: an input device 606 including, for example, a touch screen, touch pad, keyboard, mouse, etc.; an output device 607 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage 608 including, for example, tape, hard disk, etc.; and a communication device 609. The communication means 609 may allow the electronic device 600 to communicate with other devices wirelessly or by wire to exchange data. While fig. 6 illustrates an electronic device 600 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided. Each block shown in fig. 6 may represent one device or may represent multiple devices as desired.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network via the communication means 609, or may be installed from the storage means 608, or may be installed from the ROM 602. The computer program, when executed by the processing device 601, performs the above-described functions defined in the methods of embodiments of the present disclosure.

It should be noted that the computer readable medium of the embodiments of the present disclosure may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In embodiments of the disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In embodiments of the present disclosure, however, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (Radio Frequency), etc., or any suitable combination of the foregoing.

The computer readable medium may be embodied in the server; or may exist separately and not be assembled into the server. The computer readable medium carries one or more programs which, when executed by the server, cause the server to: taking the environmental state of the environment where the intelligent agent is located at the current moment as a selected state; determining a set number of environmental states in sequence before selecting a state; determining a representation formula for estimating a local intrinsic dimension of the selected state based on the selected state and a set number of environmental states, wherein the local intrinsic dimension is a mathematical quantity for measuring the dimension of a state space; based on the characterization, an intrinsic reward for directing the agent's actions is calculated.

Computer program code for carrying out operations for embodiments of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present disclosure may be implemented by software or hardware. The described units may also be provided in a processor, which may be described as: a processor comprises a selecting unit, a determining unit, a dimension representing unit and a calculating unit. Where the names of these units do not in some cases constitute a limitation on the unit itself, for example, a selection unit may also be described as a unit configured to take as a selection state an environmental state of an environment in which the agent is located at the present time.

The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention in the embodiments of the present disclosure is not limited to the specific combination of the above-mentioned features, but also encompasses other embodiments in which any combination of the above-mentioned features or their equivalents is made without departing from the inventive concept as defined above. For example, the above features and (but not limited to) technical features with similar functions disclosed in the embodiments of the present disclosure are mutually replaced to form the technical solution.

Claims

1. A method of data processing, the method comprising:

taking the environmental state of the environment where the intelligent agent is located at the current moment as a selected state;

determining a set number of environmental states in sequence before the selected state;

determining a representation for estimating a local intrinsic dimension of the selected state based on the selected state and a set number of environmental states, the local intrinsic dimension being a mathematical quantity for measuring a dimension of a state space;

based on the characterization, an intrinsic reward is calculated that directs the agent to act.

2. The method of claim 1, wherein calculating an intrinsic reward that directs the agent to act based on the characterization formula comprises:

estimating the access times of the agent to the selected state to obtain a pseudo count value of the selected state;

calculating a local intrinsic dimension estimation value of the selected state based on the pseudo count value and the characterization formula;

using the local intrinsic dimension estimate as an intrinsic reward for directing the agent's actions.

3. The method of claim 2, wherein said estimating a number of accesses by said agent to said selected state resulting in a pseudo-count value for said selected state comprises:

estimating the probability of the selected state in the set time period by adopting a trained first density model to obtain a first density value, wherein the first density model is obtained by training based on the set number of environmental states;

estimating the probability of the selected state in the set time period by using a trained second density model to obtain a second density value, wherein the second density model is obtained by training based on the selected state and the set number of environmental states;

and obtaining a pseudo count value of the selection state based on the first density value and the second density value.

4. The method of claim 3, wherein the deriving a pseudo count value for the pick state based on the first density value and the second density value comprises:

calculating an information gain of the first density value and the second density value;

and calculating to obtain a pseudo count value of the selected state based on the information gain.

5. The method of claim 2, wherein said calculating a local intrinsic dimension estimate for the selected state based on the pseudo-count value and the characterization equation comprises:

splitting the environment state corresponding to the characterization formula into the pseudo count value number of the selection states and a difference value number of environment states, wherein the difference value is the difference between the set number and the pseudo count value;

splitting the representation formula into an expression comprising a first representation formula and a second representation formula, wherein the first representation formula corresponds to the selected states of the pseudo count value, and the second representation formula corresponds to the difference environmental states;

and calculating the second representation based on the difference environmental states to obtain a local intrinsic dimension estimation value of the selected state.

6. The method of any one of claims 1-5, wherein the characterization formula is:

7. The method of any one of claims 1-5, wherein the characterization formula is:

wherein s is _t+1 To select the state, S _B ＝[s _t-k+1 ，s _t-k+2 ，...，s _t-1 ，s _t ]Sequence of environmental states, psi(s), formed for the k environmental states in sequence before the selected state _t+1 ) To select a selected feature of a state, r _i (ψ(s _t+1 )，ψ(S _B ) Representing a sequence S of environmental states _B The distance from the environment feature which is farthest from the selected feature to the selected feature in the k environment features closest to the selected feature, wherein the environment feature is the feature of the environment state; r is _max (ψ(s _t+1 )，ψ(S _B ) Representing a sequence S of environmental states _B And the distance from the environment feature which is farthest from the selected feature to the selected feature in k environment features which are closest to the selected feature, wherein k and i are natural numbers, and i is not more than k.

8. The method of claim 7, wherein the selected features and the environmental features are obtained by training a feature transformation model, and the feature transformation model is obtained by training the following steps:

performing feature transformation on the selected state and an environment state adjacent to the selected state in the environment state sequence to obtain a selected feature and an adjacent feature;

sending the environmental state adjacent to the selected state to the intelligent agent to obtain the behavior state output by the intelligent agent;

obtaining a predicted value of the selected feature based on the behavior state and the adjacent feature;

obtaining a predicted value of the behavior state based on the adjacent feature and the selected feature;

and adjusting parameters of the feature transformation network based on the predicted value of the selected feature, the predicted value of the behavior state and the behavior state to obtain a feature transformation model.

9. The method of claim 1, wherein the agent comprises an autonomous vehicle, the environmental state comprises an operational state of the autonomous vehicle;

the method further comprises the following steps: determining a behavioral state of the autonomous vehicle based on the selected state and the intrinsic reward.

10. A data processing apparatus, the apparatus comprising:

the intelligent agent selecting unit is configured to select the environment state of the environment where the intelligent agent is located at the current moment as a selected state;

a determining unit configured to determine a set number of environmental states in order before the selected state;

a dimension characterization unit configured to determine a characterization formula for estimating a local intrinsic dimension of the selected state based on the selected state and a set number of environmental states, the local intrinsic dimension being a mathematical quantity that measures a dimension of a state space;

a computing unit configured to compute an intrinsic reward for directing the agent to act based on the characterization formula.

11. An electronic device, comprising:

one or more processors;

a storage device having one or more programs stored thereon;

when executed by the one or more processors, cause the one or more processors to implement the method of any one of claims 1-9.

12. A computer-readable medium, on which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1-9.

13. A computer program product comprising a computer program which, when executed by a processor, implements the method of any one of claims 1-9.