CN115542901A

CN115542901A - Deformable robot obstacle avoidance method based on near-end strategy training

Info

Publication number: CN115542901A
Application number: CN202211154605.7A
Authority: CN
Inventors: 单光存; 丁则剑; 谭昊易
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-12-30

Abstract

The invention provides a deformable robot obstacle avoidance method based on near-end strategy training, which is suitable for being applied to an obstacle avoidance model of a deformable robot. The method for training the obstacle avoidance model comprises the following steps: generating a plurality of strategies according to the environmental information for each training sample; processing the track by utilizing a reward and punishment function aiming at each strategy to obtain accumulated reward; determining a desired reward based on the plurality of accumulated rewards and the probability; deriving the desired reward to obtain a first derivative; updating the initial parameter according to the first derivative based on a strategy gradient algorithm to obtain a target parameter; controlling the mimicry deformable robot to walk in the environment information by using the trained obstacle avoidance model, and iteratively using other training samples to train the obstacle avoidance model under the condition that the mimicry deformable robot collides with the obstacle information to obtain a new target parameter of the trained obstacle avoidance model; and determining the trained obstacle avoidance model as a target obstacle avoidance model under the condition that the mimic deformable robot does not collide with the obstacle information.

Description

Deformable robot obstacle avoidance method based on near-end strategy training

Technical Field

The present disclosure relates to the field of robotics, and more particularly, to a method for near-end strategy optimization training obstacle avoidance model based on strategy gradients, an obstacle avoidance method and apparatus for a mimic deformable robot, an electronic device, a computer-readable storage medium, and a computer program product.

Background

The mimicry deformation robot needs to avoid collision with an obstacle and other robots or grabbing targets in the operation process, complex nonlinear deformation needs to be carried out to switch to different mimicry movement modes when facing different environments and tasks, and when an obstacle avoidance scene of the intelligent robot contains a dynamic obstacle, stricter requirements on flexibility and effectiveness of an obstacle avoidance algorithm are provided.

In the course of implementing the disclosed concept, the inventors found that there are at least the following problems in the related art: the traditional linear and nonlinear control theory based on a rigid body model can not achieve a satisfactory obstacle avoidance effect.

Disclosure of Invention

In view of this, the embodiments of the present disclosure provide a method for training an obstacle avoidance model based on a near-end strategy of a strategy gradient, an obstacle avoidance method for a mimic deformable robot, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

One aspect of the embodiments of the present disclosure provides a method for training an obstacle avoidance model based on a near-end strategy of a strategy gradient, where the obstacle avoidance model is applied to a mimic deformable robot, and the method includes:

generating a plurality of strategies from the environment information for each training sample of the set of training samples for each walking phase, wherein the training samples comprise environment information including obstacle information acquired with the mimicry deformable robot, each strategy comprises a trajectory of the mimicry deformable robot walking in the external environment, the trajectory comprising a plurality of discrete actions and a state corresponding to each action, the actions being generated with a strategy function comprising initial parameters, wherein the actions comprise at least one of: snake-shaped deformation action, spherical deformation action and square or rectangular deformation action;

aiming at each strategy, processing the track by utilizing a reward and punishment function to obtain the accumulated reward of the track, wherein the reward and punishment function is determined according to the reward corresponding to the initial parameter, and the reward and punishment functions in the walking stages in different mimicry are different from the strategy functions;

determining expected rewards of the strategies according to a plurality of accumulated rewards corresponding to a plurality of strategies and the probability corresponding to each track, wherein the probability represents the probability of selecting the track of the current strategy from the tracks corresponding to the plurality of strategies according to the current state of the mimicry deformable robot;

carrying out derivation processing on the expected reward to obtain a first derivative;

updating the initial parameter according to the first derivative based on a strategy gradient algorithm to obtain a target parameter of the trained obstacle avoidance model;

controlling the mimicry deformable robot to walk in the environment information of the walking stage by using the trained obstacle avoidance model, and iteratively using other training samples to train the obstacle avoidance model under the condition that the mimicry deformable robot collides with the obstacle information to obtain a new target parameter of the trained obstacle avoidance model;

and under the condition that the mimicry deformable robot does not collide with the obstacle information, determining the trained obstacle avoidance model as the target obstacle avoidance model in the walking stage.

Another aspect of the embodiments of the present disclosure provides an obstacle avoidance method for a mimic deformable robot, including:

acquiring target environment information including target obstacle information acquired by a plurality of ultrasonic sensors of the mimicry deformable robot aiming at each target walking stage;

processing the target environment information by using a trained target obstacle avoidance model, and outputting a target track of the target walking stage, wherein the target track comprises a plurality of discrete target actions walking in the target environment and a target state corresponding to each target action, and the target actions comprise at least one of the following: snake-shaped deformation action, spherical deformation action and rectangular deformation action;

the mimic deformable robot executes a walking operation in the target walking stage according to the target trajectory, wherein the walking operation can avoid collision between the mimic deformable robot and the target obstacle information.

Another aspect of the embodiments of the present disclosure provides an apparatus for training an obstacle avoidance model based on a near-end strategy of a strategy gradient, including:

a generating module, configured to generate a plurality of strategies according to the environment information for each training sample in the training sample set of each walking stage, wherein the training samples include environment information including obstacle information acquired by the mimicry deformable robot, each strategy includes a track traveled by the mimicry deformable robot in an external environment, the track includes a plurality of discrete actions and a state corresponding to each action, the actions are generated by a strategy function including initial parameters, wherein the actions include at least one of: snake-shaped deformation action, spherical deformation action and square or rectangular deformation action;

the first obtaining module is used for processing the track by utilizing a reward and punishment function aiming at each strategy to obtain the accumulated reward of the track, wherein the reward and punishment function is determined according to the reward corresponding to the initial parameter, and the reward and punishment functions in different mimicry walking stages are different from the strategy functions;

the first determination module is used for determining the expected reward of the strategies according to a plurality of accumulated rewards corresponding to a plurality of strategies and the probability corresponding to each track, wherein the probability represents the probability of selecting the track of the current strategy from the tracks corresponding to the strategies according to the current state by the mimicry deformable robot;

the second obtaining module is used for carrying out derivation processing on the expected reward to obtain a first derivative;

the third obtaining module is used for updating the initial parameters according to the first derivative based on a strategy gradient algorithm to obtain target parameters of the trained obstacle avoidance model;

the simulation module is used for controlling the mimicry deformable robot to walk in the environment information of the walking stage by using the trained obstacle avoidance model, and iteratively using other training samples to train the obstacle avoidance model under the condition that the mimicry deformable robot collides with the obstacle information to obtain a new target parameter of the trained obstacle avoidance model;

and the second determination module is used for determining the trained obstacle avoidance model as the target obstacle avoidance model in the walking stage under the condition that the mimicry deformable robot does not collide with the obstacle information.

Another aspect of the embodiments of the present disclosure provides an obstacle avoidance apparatus for a mimic deformable robot, including:

the acquisition module is used for acquiring target environment information including target obstacle information acquired by a plurality of ultrasonic sensors of the mimicry deformable robot aiming at each target walking stage and establishing a state space by determining the distance between the mimicry deformable robot and the obstacle;

an output module, configured to process target environment information by using a target obstacle avoidance model, and output a target track in the target walking stage, where the target track includes a plurality of discrete target actions walking in a target environment and a target state corresponding to each target action, and the target actions include at least one of: snake-shaped deformation action, spherical deformation action and square or rectangular deformation action;

and the execution module is used for executing the walking operation of the mimicry deformable robot in the target walking stage according to the target track, wherein the walking operation can avoid the collision between the mimicry deformable robot and the target obstacle information.

Another aspect of an embodiment of the present disclosure provides an electronic device including: one or more processors; memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described above.

Another aspect of embodiments of the present disclosure provides a computer-readable storage medium storing computer-executable instructions for implementing the method as described above when executed.

Another aspect of an embodiment of the present disclosure provides a computer program product comprising computer executable instructions for implementing the method as described above when executed.

According to the embodiment of the disclosure, the corresponding strategy is determined for the training sample, the expected reward of the corresponding track is determined based on the strategy, the initial parameter of the first derivative determined by derivation is updated, and the target obstacle avoidance model is obtained.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

fig. 1 schematically illustrates an exemplary system architecture to which a method of training an obstacle avoidance model may be applied, according to an embodiment of the present disclosure;

fig. 2 schematically illustrates a flow chart of a method of training an obstacle avoidance model according to an embodiment of the present disclosure;

FIG. 3 schematically shows a schematic diagram of a near-end policy generating a plurality of policies according to an embodiment of the present disclosure;

FIG. 4 schematically shows a trajectory sequence diagram according to an embodiment of the disclosure;

FIG. 5 schematically illustrates a PPO-Clip algorithm according to an embodiment of the present disclosure;

fig. 6 schematically illustrates an obstacle avoidance principle schematic diagram of a mimic deformable robot according to an embodiment of the present disclosure;

fig. 7 schematically illustrates an obstacle avoidance scene diagram of a mimic deformable robot according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates an obstacle avoidance success rate diagram for a mimic deformable robot according to an embodiment of the present disclosure;

fig. 9 schematically shows a flowchart of an obstacle avoidance method of the mimicry deformable robot according to an embodiment of the present disclosure;

fig. 10 schematically illustrates a block diagram of an apparatus for training an obstacle avoidance model according to an embodiment of the present disclosure;

fig. 11 schematically shows a block diagram of an obstacle avoidance apparatus of a mimic variable robot according to an embodiment of the present disclosure;

fig. 12 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "may include," "comprises," "including," and the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (which can include technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art unless otherwise defined. It is noted that the terms used herein should be interpreted as having a meaning that is consistent with the context of this specification and should not be interpreted in an idealized or overly formal sense.

Where a convention analogous to "at least one of A, B, and C, etc." is used, in general such a construction is intended in the sense one having skill in the art would understand the convention (e.g., "a system having at least one of A, B, and C" would include but not be limited to systems that have A alone, B alone, C alone, A and B together, A and C together, B and C together, and/or A, B, and C together, etc.).

Embodiments of the present disclosure provide a method for training an obstacle avoidance model based on a near-end strategy of a strategy gradient, an obstacle avoidance method for a mimic deformable robot, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product. The method for training the obstacle avoidance model can comprise the following steps: for each training sample, the strategy function generates a strategy comprising a plurality of actions according to the environment information; aiming at each strategy, processing the track by utilizing a reward and punishment function to obtain accumulated rewards; determining an expected reward according to the plurality of accumulated rewards and the probability; carrying out derivation processing on the expected reward to obtain a first derivative; updating the initial parameter according to the first derivative based on a strategy gradient algorithm to obtain a target parameter; controlling the mimicry deformable robot to walk in the environment information by using the trained obstacle avoidance model, and iteratively using other training samples to train the obstacle avoidance model under the condition that the mimicry deformable robot collides with the obstacle information to obtain a new target parameter of the trained obstacle avoidance model; and determining the trained obstacle avoidance model as a target obstacle avoidance model under the condition that the mimicry deformable robot does not collide with the obstacle information.

Fig. 1 schematically illustrates an exemplary system architecture 100 to which the method of training an obstacle avoidance model may be applied, according to an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present disclosure may be applied to help those skilled in the art understand the technical content of the present disclosure, and does not mean that the embodiments of the present disclosure may not be applied to other devices, systems, environments or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include a mimic

deformable robot

101, 102, 103, a network 104, and a server 105. Network 104 is the medium used to provide a communication link between the mimicry

deformable robots

101, 102, 103 and server 105. Network 104 may include various connection types, such as wired and/or wireless communication links, and so forth.

The user can interact with a server 105 over a network 104 using the anthropomorphic

deformable robots

101, 102, 103 to receive or send messages or the like. Various communication client applications, such as training applications and environmental information processing applications, can be installed on the simulated

transformable robots

101, 102, 103.

The mimicry

transformable robots

101, 102, 103 may be various robots having robot arms and supporting walking, transformation.

The server 105 may be a server providing various services, such as a background management server (for example only) providing a track of walking to the environment in which the user works with the

mimicry deformable robot

101, 102, 103. The background management server can analyze and process the received data such as the environmental information and the like, and feed back a processing result (for example, a target obstacle avoidance model generated according to the environmental information and the like) to the mimic deformable robot.

It should be noted that the method for training an obstacle avoidance model based on a near-end strategy of a strategy gradient provided by the embodiment of the present disclosure may be generally performed by the server 105. Accordingly, the apparatus for training the obstacle avoidance model based on the near-end strategy of the strategy gradient provided by the embodiment of the present disclosure may be generally disposed in the server 105. The method for training obstacle avoidance models based on the near-end strategy of the strategy gradient provided by the embodiment of the present disclosure can also be executed by a server or a server cluster which is different from the server 105 and can communicate with the mimic

deformable robots

101, 102, 103 and/or the server 105. Accordingly, the apparatus for training the obstacle avoidance model based on the strategy gradient provided by the embodiment of the present disclosure may also be disposed in a server or a server cluster different from the server 105 and capable of communicating with the mimicry

deformable robots

101, 102, 103 and/or the server 105. Alternatively, the method for training the obstacle avoidance model based on the strategy gradient of the near-end strategy provided by the embodiment of the disclosure can be executed by the mimic

deformable robot

101, 102 or 103, or can be executed by other mimic deformable robots different from the mimic

deformable robot

101, 102 or 103. Accordingly, the apparatus for training obstacle avoidance model based on strategy gradient near-end strategy provided by the embodiment of the present disclosure can also be disposed in the mimic

deformable robot

101, 102, or 103, or disposed in other mimic deformable robots different from the mimic

deformable robot

101, 102, or 103.

It should be understood that the number of mimicry deformable robots, networks and servers in fig. 1 is merely illustrative. There may be any number of mimicry deformable robots, networks, and servers, as desired for implementation.

Fig. 2 schematically shows a flowchart of a method of training an obstacle avoidance model according to an embodiment of the present disclosure.

As shown in fig. 2, a method for training an obstacle avoidance model based on a near-end strategy of a strategy gradient, where the obstacle avoidance model is applied to a mimic deformable robot, may include operations S201 to S207.

In operation S201, for each training sample in the training sample set of each walking phase, a plurality of strategies is generated according to the environment information, wherein the training samples may include environment information acquired by the mimic deformable robot, which may include obstacle information, each strategy may include a trajectory traveled by the mimic deformable robot in the external environment, the trajectory may include a plurality of discrete actions and a state corresponding to each action, the actions being generated by a strategy function, which may include initial parameters, wherein the actions include at least one of: serpentine deformation action, spherical deformation action, and square or rectangular deformation action.

In operation S202, for each policy, the trajectory is processed by using a reward and punishment function to obtain an accumulated reward of the trajectory, where the reward and punishment function is determined according to a reward corresponding to the initial parameter, and the reward and punishment functions of the walking stages in different mimicry states are different from the policy function.

In operation S203, a desired reward for the strategy is determined according to a plurality of accumulated rewards corresponding to the plurality of strategies and a probability corresponding to each trajectory, wherein the probability represents a probability of the mimicry deformable robot selecting a trajectory of a current strategy from among the trajectories corresponding to the plurality of strategies according to a current state.

In operation S204, a derivation process is performed on the desired award to obtain a first derivative.

In operation S205, the initial parameter is updated according to the first derivative based on the policy gradient algorithm, so as to obtain a target parameter of the trained obstacle avoidance model.

In operation S206, the mimicry deformable robot is controlled to walk in the environment information of the walking stage by using the trained obstacle avoidance model, and when the mimicry deformable robot collides with the obstacle information, the simulated deformable robot is iteratively trained by using other training samples to obtain the target parameters of the new trained obstacle avoidance model.

In operation S207, the trained obstacle avoidance model is determined as the target obstacle avoidance model in the walking stage under the condition that the mimicry deformable robot does not collide with the obstacle information.

According to the embodiment of the disclosure, the mimicry deformable robot can be a robot based on the characteristic that the self structure is freely deformed, and the deformation similar to living things is realized under the control of an intelligent feedback control algorithm, so that the mimicry deformable robot is called as the mimicry deformable robot.

According to the embodiment of the present disclosure, the deformable shape (motion), i.e., the motion space, of the mimicry deformable robot may include, but is not limited to, a serpentine deformation motion, a spherical deformation motion, a square or rectangular deformation motion, etc., and the mimicry deformable robot makes different discrete motions according to the acquired environment information including the obstacle information, so as to avoid collision. According to embodiments of the present disclosure, a strategy may refer to a walking path in an environment, for example, when walking in an environment with one obstacle, a path from the left side of the obstacle to the destination is one strategy, a path from the right side of the obstacle to the destination is another strategy, and a walking phase may refer to a walking path of a next step of the mimicry deformable robot at different positions.

According to the embodiment of the disclosure, when the obstacle avoidance model is trained, a plurality of strategies that the simulated variable robot bypasses the obstacle can be generated according to the environmental information in the training sample, for each strategy, the track tau of the strategy is processed by utilizing a reward punishment function, the accumulated reward of the track can be obtained, and for a plurality of accumulated rewards of the strategies and the probability p corresponding to each track tau _θ (τ) determining a desired reward for a policy

The desired reward is derived to obtain a first derivative as shown in equation (3)

Wherein, the trace τ can be τ = { s = ₁ ，a ₁ ，s ₂ ，a ₂ ，...，s _t ，a _t Denotes a, a _t Represents the action of step t, s _t Indicating the state of the t-th step. Probability p of trace tau correspondence _θ (τ) can be expressed by equation (1) and a desired prize

Can be expressed by equation (2).

Wherein T represents the total number of actions, p and p _θ Each represents a probability of selecting the current action.

According to an embodiment of the present disclosure, based on a policy gradient algorithm, according to the first derivative

And updating the initial parameter theta to obtain a target parameter theta DEG of the trained obstacle avoidance model. Controlling the mimicry deformable robot to walk in simulation in the environment information of the training sample by using the trained obstacle avoidance model, if the mimicry deformable robot collides with the obstacle information, iteratively using other training samples to train the obstacle avoidance model to obtain the target parameters of the new trained obstacle avoidance model until the mimicry deformable robot is controlled to walk in the environment information without collision by using the trained obstacle avoidance model,determining the trained obstacle avoidance model as the target obstacle avoidance model of the walking stage.

According to the embodiment of the disclosure, the corresponding strategy is determined for the training sample, the expected reward corresponding to the track is determined based on the strategy, the initial parameter is updated on the first derivative determined by derivation, and the target obstacle avoidance model is obtained.

Fig. 3 schematically shows a schematic diagram of a near-end policy generation of a plurality of policies according to an embodiment of the present disclosure. Fig. 4 schematically shows a trajectory sequence diagram according to an embodiment of the disclosure.

As shown in fig. 3 and 4, generating a plurality of policies according to the environment information may include the following operations:

a plurality of current actions are generated from the environmental information using a policy function. And processing the current action and the state by using a value function aiming at each plurality of current actions to obtain the current value corresponding to the current action. And processing the current action and the current value by using the advantage function to generate a next action.

According to the embodiment of the present disclosure, a plurality of current actions a are generated according to environment information by using a policy function (i.e., actor network in fig. 3) _t (i.e., A in FIG. 3) _t ) Extract from the empirical playback pool the action A _t Corresponding state s _t (i.e., S in FIG. 3) _t ) The current action and state are processed by the merit function (i.e., critic network in fig. 3) to obtain a current value V (S) corresponding to the current action _t ) Thereby utilizing the dominance function A ^θ′ (S _t ，A _t ) Processing the current action a _t And the current value V (S) _t ) Generating the next action A _t+1 So that the track sequence shown in fig. 4 can be obtained. Wherein experience playback poolsAt least the state corresponding to each action in the history state is stored, in fig. 4, G represents the accumulated reward, and the action and the state of each row are a track, that is, a policy.

According to an embodiment of the present disclosure, the policy gradient algorithm may include a gradient ascent method.

Based on a policy gradient algorithm, updating the initial parameter according to the first derivative to obtain a target parameter of the trained obstacle avoidance model, which may include the following operations:

and converting the first derivative by using a logarithmic function derivative formula to obtain a second derivative. Based on the gradient rise method, a plurality of accumulated rewards corresponding to a plurality of training samples are utilized to determine a reward expectation average value. A third derivative is determined based on the reward desired average and the second derivative. And determining the target parameter according to the third derivative and the initial parameter.

According to an embodiment of the present disclosure, the first derivative is derived using a logarithmic function derivation formula

Performing conversion processing to obtain a second derivative

Assuming that each track tau is given a cumulative reward of R (tau), the probability p of the track tau occurring is used based on a gradient ascent method _θ (τ) weighted sum, i.e. the desired average value of the reward R (τ) is determined ⁿ ) According to the reward expectation mean R (tau) ⁿ ) And a second derivative

Determining a third derivative

According to the third derivative

And an initial parameter theta, determining a target parameter theta DEG, and calculating the target parameter theta DEG according to the formula (4).

Where η represents the weight, i.e. the reward obtained after an action is performed.

According to an embodiment of the present disclosure, the reward penalty function R is shown in equation (5):

wherein θ represents an initial parameter, R _θ Representing the cumulative reward corresponding to the initial parameter, R when D < D _θ =250, R when D > D _θ = 150,k is a discount factor specifically set according to the actual situation; d represents the punishment distance of the mimic deformable robot from the obstacle, and D represents the minimum distance of the mimic deformable robot from the obstacle.

The logarithmic function derivation formula is shown in formula (6), and the second derivative is

The third derivative is shown in equation (7)

As shown in equation (8):

wherein τ represents a trajectory; r represents a cumulative prize; θ represents an initial parameter; p is a radical of formula _θ Representing a probability; e _τ～pθ (τ) represents a reward expected average; n represents the number of training samples; n represents the nth training sample; a represents an action; s represents a state; t represents the t-th; t represents the total number of actions or states.

According to another embodiment of the present disclosure, updating the initial parameter according to the first derivative to obtain a target parameter of the trained obstacle avoidance model, may include the following operations:

and converting the first derivative by using a logarithmic function derivative formula to obtain a fourth derivative. A fifth derivative is determined based on the fourth derivative and the reward average. And performing weight optimization processing on the fifth derivative to obtain a sixth derivative. And carrying out parameter replacement processing on the sixth derivative to obtain a seventh derivative. And determining the target parameter according to the seventh derivative and the initial parameter.

According to an embodiment of the disclosure, the fourth derivative may be the same as the second derivative

Similarly, the first derivative is obtained by converting the first derivative by using a logarithmic function derivation formula.

According to the embodiment of the disclosure, the core idea of the strategy gradient algorithm is to reduce the sampling probability of rewarding less actions by increasing the sampling probability of rewarding more actions, so that the intelligent agent learns the optimal behavior strategy. However, there is a general phenomenon that when designing a reward penalty function, the reward of most actions is positive, which results in that the agent (the mimicry deformable robot of the present disclosure) may learn the following strategy: when training begins, actions with fewer rewards are collected, the probability of the actions after updating is increased, and in subsequent training, the actions with the smaller rewards are sampled for multiple times, so that the probability of the actions with the smaller rewards is higher and higher, and gradually exceeds the probability of the actions with more rewards, which leads to the fact that the intelligent agent learns a suboptimal strategy and falls into local optimum. The reason for this is that the reward and punishment function is designed so that the action reward is positive, which leads to local optimization problems in the policy gradient algorithm. By applying a first derivative

The calculation formula of (2) is added with the reward average value E [ R (tau) ] of the intelligent agent after one training]As a baseline, the action reward can be made positive or negative during the calculation process, thereby eliminating the sub-optimal solution caused by the above problem. During training, the value of R (tau) is continuously recorded and the average value is calculated, and the base line is continuously updated. Fifth derivative of

The calculation formula is shown in formula (8):

according to an embodiment of the present disclosure, the fifth derivative according to equation (8)

All state-action pairs are weighted with the same reward. However, different actions in a turn may result in different rewards, which do not represent all actions on a track being paired, and each different action may be given different weights as shown. Since it is difficult to sample enough data in actual training, in order to assign a reasonable weight to each different action, only the reward obtained from this action is calculated to achieve a reasonable assignment task.

The reward for the t' th action of the nth trace, the sixth derivative

Can be expressed by equation (9):

according to an embodiment of the present disclosure, when indicating a track reward,

is a general representation method and does not reflect the discount of future rewards at the current moment, so the sixth derivative is needed

In (1)

Instead of using

So that the seventh derivative shown in the formula (10) can be obtained

Thereby referring to equation (4) according to the seventh derivative

And an initial parameter theta, determining a target parameter.

Wherein, the first and the second end of the pipe are connected with each other,

is a dominance function A ^θ′ (s _t ，a _t )。

According to an embodiment of the present disclosure, updating the initial parameter according to the first derivative to obtain the target parameter may include the following operations:

and converting the first derivative by using a logarithmic function derivative formula to obtain an eighth derivative. And carrying out distribution conversion processing on the eighth derivative to obtain a ninth derivative. And obtaining a first gradient calculation value according to the dominance function and the ninth derivative. And obtaining a second gradient calculation value added with the importance sample according to the ninth derivative and the first gradient calculation value. And simplifying the second gradient calculation value to obtain a target gradient calculation value. And determining target parameters according to the target gradient calculation value and the initial parameters.

According to an embodiment of the disclosure, the eighth derivative may be the same as the second derivative

The same result is obtained by converting the first derivative by using the logarithm function derivation formula

According to the embodiment of the present disclosure, the Policy gradient algorithm is a typical same Policy (On Policy) algorithm, that is, the action Policy and the evaluation Policy are the same Policy, which results in an extremely low data utilization rate, because only when the agent (the mimic deformable robot) collects a complete round of data, the gradient of the objective function can be calculated and the network parameters can be updated, and when the network parameters are updated, the data is discarded and new data is collected again to update the network, such a training mode can make the network training speed very slow. In order to solve the problem of slow training speed of the strategy gradient algorithm, the PPO algorithm provides an importance-sampling (inportant-sampling) idea.

The importance sampling idea is provided, so that an agent of the PPO algorithm can use track sequence data sampled by historical behavior strategies with different parameters when updating the current behavior strategy, and therefore, the PPO algorithm can utilize historical data although being a same strategy (On Policy) algorithm. The network can be updated by using the historical data, so that the training speed of the network model is improved. It is assumed that the data used for updating the network parameters is from p distribution, but at this time, data can only be collected from q distribution, so when the function expectation is obtained, the function formula needs to be modified for transformation, the importance sampling formula is shown as formula (11), and formula (11) is also a distribution conversion formula.

The p-distribution and the q-distribution are obtained from environmental information collected by the mimicry deformable robot, and for example, a threshold range may be set, where the environmental information within the threshold range is the p-distribution and the environmental information outside the threshold range is the q-distribution.

According to the embodiment of the present disclosure, the distribution conversion processing is performed on the eighth derivative based on the distribution conversion formula, so as to obtain a ninth derivative as shown in formula (12)

Using the dominance function A ^θ′ (s _t ，a _t ) Substituting R (tau) to obtain a first gradient calculation value

Combined ninth derivative

And a first gradient calculation value

Obtaining a second gradient calculation value added with the importance sample

Due to the existence of a hypothesis strategy _θ And strategy pi _θ′ Are not widely distributed, therefore

Approximately 1, so that a value can be calculated for the second gradient

Simplified processing is carried out to obtain a target gradient calculated value

Calculating a value according to the target gradient with reference to equation (4)

And an initial parameter theta, determining a target parameter.

Note that the strategy is _θ And strategy π _θ′ Sampled from the p-distribution and q-distribution, respectively.

According to an embodiment of the present disclosure, the ninth derivative

The first gradient calculated value as shown in equation (12)

The second gradient calculation value is calculated as shown in equation (13)

The target gradient calculated value is as shown in equation (14)

As shown in equation (15):

wherein p is _θ Characterizing the probability, p, corresponding to a policy _θ′ Token pair strategy pi _θ Transition strategy pi obtained by conversion _θ′ Corresponding probability, A ^θ′ (s _t ，a _t ) Characterizing a merit function corresponding to the transition strategy, wherein,

e characterizes the reward.

According to an embodiment of the present disclosure, the method for training the obstacle avoidance model may further include the following operations:

and determining an initial expected reward function according to the target gradient calculation value and the target parameter, wherein the initial expected reward function can comprise strategy distribution difference, an initial parameter and a behavior strategy parameter. And (4) cutting the strategy distribution difference to obtain a target expectation reward function.

The behavior policy parameters are processed based on a gradient ascent method such that a maximum expected reward value is determined according to a trajectory corresponding to the goal policy and a goal expected reward function. And processing the initial parameters by using a gradient descent method based on the maximum expected reward value to obtain transition initial parameters.

And under the condition that the mean square error between the transition reward and the transition value is smaller than a preset threshold value, determining the transition initial parameter as a target initial parameter, wherein the transition reward is determined according to a reward and punishment function, the transition initial parameter and a transition strategy, the transition value is determined according to a cost function, the transition initial parameter and the transition strategy, and the transition strategy is obtained by converting strategies. And determining a new reward and punishment function according to the target initial parameter and the reward and punishment function.

According to an embodiment of the present disclosure, a value is calculated according to a target gradient

And the target parameter theta deg., an initial desired reward function J is determined as shown in equation (16) ^θ°′ (θ ^° )。

FIG. 5 schematically shows a PPO-Clip algorithm according to an embodiment of the present disclosure.

As shown in FIG. 5, the PPO-Clip algorithm uses a Clip function to distribute policy differences

To perform manual workAnd (4) cutting to obtain a target expected reward function shown in the formula (17).

According to an embodiment of the present disclosure, when A ^θ°′ (s _t ，a _t ) When the concentration of the carbon dioxide is more than 0,

has a maximum value of 1+ epsilon when A ^θ°′ (s _t ，a _t ) When the ratio is less than 0, the reaction mixture,

has a minimum value of 1-epsilon.

According to an embodiment of the present disclosure, the behavior policy parameters are processed based on a gradient ascent method such that pi is a function of the target policy _θ′ Determining the maximum expected reward value according to the corresponding track and the target expected reward function, processing the action strategy parameters based on the gradient ascent method so as to enable the action strategy parameters to be pi according to the target strategy _θ′ Determining a maximum expected reward value for the corresponding trajectory and target expected reward function

Based on maximum expected reward value

Processing the initial parameter theta by using a gradient descent method to obtain a transitional initial parameter theta-, rewarding at the transition

And transitional value

In the mean square error between is less than a preset threshold value w, i.e.

Making the transition initiallyThe number is determined as the target initial parameter. And determining a new reward and punishment function according to the target initial parameter and the reward and punishment function.

Fig. 6 schematically shows an obstacle avoidance principle diagram of the mimic deformable robot according to the embodiment of the present disclosure. Fig. 7 schematically shows an obstacle avoidance scene diagram of a mimic deformable robot according to an embodiment of the present disclosure. Fig. 8 schematically shows an obstacle avoidance success rate diagram of the mimic deformable robot according to the embodiment of the disclosure.

According to the embodiment of the disclosure, based on the obstacle avoidance principle shown in fig. 6, the simulated deformable robot performs the simulation test in the underwater environment shown in fig. 7, and a curve of the obstacle avoidance success rate of the simulated deformable robot changing with the training times is shown in fig. 8. As can be seen from fig. 8, the statistical data at the beginning of the training stage is small, and the obstacle avoidance success frequency and the obstacle avoidance failure frequency have a large influence on the obstacle avoidance success rate, so that the obstacle avoidance success rate curve at the beginning of the experiment is relatively tortuous, and the fluctuation of the curve gradually decreases as the number of the training increases. Finally, the obstacle avoidance success rate of the mimicry deformable robot is converged to about 80%. The results not only show that the PPO obstacle avoidance method has excellent obstacle avoidance capability in a complex obstacle avoidance scene, but also verify that the PPO obstacle avoidance method has excellent robustness.

Therefore, the near-end strategy optimization obstacle avoidance method based on the measured gradient has the capability of flexibly and effectively processing unknown complex dynamic obstacle avoidance scenes, the model is finally converged normally, a good obstacle avoidance success rate is kept in a new obstacle avoidance scene, and the robustness is good.

Fig. 9 schematically shows a flowchart of an obstacle avoidance method of the mimicry deformable robot according to an embodiment of the present disclosure.

As shown in fig. 9, the obstacle avoidance method of the mimic deformable robot may include operations S901 to S903.

In operation S901, target environment information, which may include target obstacle information, acquired by a plurality of ultrasonic sensors of an mimicry deformable robot is acquired for each target walking stage.

In operation S902, the target environment information is processed by using the target obstacle avoidance model, and a target track of a target walking stage is output, where the target track may include a plurality of discrete target actions walking in the target environment and a target state corresponding to each target action, where the target actions include at least one of: serpentine deformation action, spherical deformation action and square or rectangular deformation action.

In operation S903, the mimicry deformable robot performs a walking operation of the target walking stage according to the target trajectory, wherein the walking operation can avoid collision of the mimicry deformable robot with the target obstacle information.

According to the embodiment of the disclosure, after the target environment information which can include the target obstacle information is acquired by the plurality of ultrasonic sensors of the mimicry deformable robot for each target walking stage, the mimicry deformable robot is controlled to walk in the target environment information of the target walking stage by using the trained obstacle avoidance model, the coordinates of a target point are initialized randomly at the beginning of each round, and the mimicry deformable robot is reset to the central coordinates (0, 0). If the mimicry unmanned system can successfully avoid all target obstacles and successfully reach a target point, the success frequency is increased by 1, and if the mimicry unmanned system collides with the obstacles in the driving process, the failure frequency is increased by 1.

According to the embodiment of the disclosure, the corresponding strategy is determined for the training sample, the expected reward corresponding to the track is determined based on the strategy, the initial parameter is updated on the first derivative determined by derivation, and the obstacle avoidance model is obtained.

According to an embodiment of the present disclosure, before outputting the target trajectory, the following operations may be further included:

and generating a target accumulated reward according to the target track, and updating target parameters in the target obstacle avoidance model according to the target accumulated reward.

According to the embodiment of the disclosure, the target parameters are updated in real time through the environmental information of the mimicry deformable robot in actual use, so that the obstacle avoidance model is updated and optimized through repeated learning.

Fig. 10 schematically illustrates a block diagram of an apparatus for training an obstacle avoidance model according to an embodiment of the present disclosure.

As shown in fig. 10, an apparatus 1000 for training an obstacle avoidance model based on a near-end strategy of a strategy gradient may include a generating module 1001, a first obtaining module 1002, a first determining module 1003, a second obtaining module 1004, a third obtaining module 1005, a simulating module 1006, and a second determining module 1007.

A generating module 1001, configured to generate a plurality of strategies according to the environment information for each training sample in the training sample set of each walking stage, wherein the training samples may include environment information acquired by the mimicry deformable robot, which may include obstacle information, each strategy may include a trajectory traveled by the mimicry deformable robot in the external environment, the trajectory may include a plurality of discrete actions and a state corresponding to each action, the actions being generated by a strategy function, which may include initial parameters, wherein the actions include at least one of: serpentine deformation action, spherical deformation action, and square or rectangular deformation action.

The first obtaining module 1002 is configured to process the trajectory by using a reward and punishment function for each policy, so as to obtain an accumulated reward of the trajectory, where the reward and punishment function is determined according to a reward corresponding to an initial parameter, and the reward and punishment functions at walking stages in different mimicry are different from the policy function.

A first determining module 1003, configured to determine the desired reward of the strategy according to a plurality of accumulated rewards corresponding to a plurality of strategies and a probability corresponding to each trajectory, where the probability represents a probability of selecting a trajectory of a current strategy from the trajectories corresponding to the plurality of strategies according to a current state of the mimicry deformable robot.

A second derivation module 1004 for deriving the desired reward to obtain a first derivative.

A third obtaining module 1005, configured to update the initial parameter according to the first derivative based on a policy gradient algorithm, so as to obtain a target parameter of the trained obstacle avoidance model.

And the simulation module 1006 is configured to control the mimicry deformable robot to walk in the environment information of the walking stage by using the trained obstacle avoidance model, and iteratively use other training samples to train the obstacle avoidance model under the condition that the mimicry deformable robot collides with the obstacle information, so as to obtain a new target parameter of the trained obstacle avoidance model.

And a second determining module 1007, configured to determine the trained obstacle avoidance model as the target obstacle avoidance model in the walking stage when the mimicry deformable robot does not collide with the obstacle information.

According to an embodiment of the present disclosure, the generating module 1001 may include a first generating unit, a first obtaining unit, and a second generating unit.

A first generating unit for generating a plurality of current actions according to the environment information by using the policy function.

And a first obtaining unit, configured to process, for each of the plurality of current actions, the current action and the state by using a cost function, and obtain a current cost corresponding to the current action.

And the second generation unit is used for processing the current action and the current value by using the advantage function to generate the next action.

According to an embodiment of the present disclosure, the third obtaining module 1005 may include a first converting unit, a first determining unit, a second determining unit, and a third determining unit.

And the first conversion unit is used for converting the first derivative by using a logarithmic function derivation formula to obtain a second derivative.

The first determining unit is used for determining the expected average value of the reward by utilizing a plurality of accumulated rewards corresponding to a plurality of training samples based on a gradient ascending method.

A second determining unit for determining a third derivative based on the reward desired average and the second derivative.

A third determining unit for determining the target parameter based on the third derivative and the initial parameter.

According to an embodiment of the present disclosure, the third obtaining module 1005 may include a second converting unit, a fourth determining unit, an optimizing unit, a replacing unit, and a fifth determining unit.

And the second conversion unit is used for performing conversion processing on the first derivative by using a logarithmic function derivation formula to obtain a fourth derivative.

A fourth determining unit for determining a fifth derivative based on the fourth derivative and the reward average.

And the optimization unit is used for performing weight optimization processing on the fifth derivative to obtain a sixth derivative.

And the replacing unit is used for performing parameter replacing processing on the sixth derivative to obtain a seventh derivative.

And a fifth determining unit, configured to determine the target parameter according to the seventh derivative and the initial parameter.

According to an embodiment of the present disclosure, the third obtaining module 1005 may include a third converting unit, a fourth converting unit, a second obtaining unit, a third obtaining unit, a simplifying unit, and a sixth determining unit.

And the third conversion unit is used for carrying out conversion processing on the first derivative by using a logarithmic function derivation formula to obtain an eighth derivative.

And the fourth conversion unit is used for performing distribution conversion processing on the eighth derivative to obtain a ninth derivative.

And the second obtaining unit is used for obtaining a first gradient calculation value according to the dominance function and the ninth derivative.

And the third obtaining unit is used for obtaining a second gradient calculated value added with the importance sample according to the ninth derivative and the first gradient calculated value.

And the simplifying unit is used for simplifying the second gradient calculated value to obtain a target gradient calculated value.

And the sixth determining unit is used for determining the target parameters according to the target gradient calculation value and the initial parameters.

According to an embodiment of the present disclosure, the third obtaining module 1005 may further include a seventh determining unit, a fourth obtaining unit, an eighth determining unit, a fifth obtaining unit, a ninth determining unit, and a tenth determining unit.

And a seventh determining unit, configured to determine an initial expected reward function according to the target gradient calculation value and the target parameter, where the initial expected reward function may include a policy distribution difference, an initial parameter, and an action policy parameter.

And the fourth obtaining unit is used for performing cutting processing on the strategy distribution difference to obtain the target expected reward function.

And the eighth determining unit is used for processing the behavior strategy parameters based on a gradient ascent method so as to determine the maximum expected reward value according to the track corresponding to the target strategy and the target expected reward function.

And a fifth obtaining unit, configured to process the initial parameter by using a gradient descent method based on the maximum expected reward value, so as to obtain a transition initial parameter.

And the ninth determining unit is used for determining the transition initial parameter as the target initial parameter under the condition that the mean square error between the transition reward and the transition value is smaller than a preset threshold value, wherein the transition reward is determined according to a reward and punishment function, the transition initial parameter and a transition strategy, the transition value is determined according to a cost function, the transition initial parameter and the transition strategy, and the transition strategy is obtained by converting strategies.

And the tenth determining unit is used for determining a new reward and punishment function according to the target initial parameter and the reward and punishment function.

Fig. 11 schematically shows a block diagram of an obstacle avoidance apparatus of a mimic variable robot according to an embodiment of the present disclosure.

As shown in fig. 11, an obstacle avoidance apparatus 1100 of a mimic deformable robot may include an acquisition module 1101, an output module 1102, and an execution module 1103.

An obtaining module 1101, configured to obtain, for each target walking stage, target environment information, which may include target obstacle information, acquired by a plurality of ultrasonic sensors of the mimicry deformable robot.

The output module 1102 is configured to process target environment information by using the target obstacle avoidance model, and output a target track of a target walking stage, where the target track may include a plurality of discrete target actions walking in a target environment and a target state corresponding to each target action, and the target actions include at least one of the following: serpentine deformation action, spherical deformation action, and square or rectangular deformation action.

And the executing module 1103 is used for executing the walking operation of the mimicry deformable robot in the target walking stage according to the target track, wherein the walking operation can avoid the mimicry deformable robot colliding with the target obstacle information.

According to the embodiment of the disclosure, the obstacle avoidance device may further include a second generation module and an update module.

And the second generation module is used for generating the target accumulated reward according to the target track.

And the updating module is used for updating the target parameters in the obstacle avoidance model according to the target accumulated reward.

Any of the modules, units, or at least part of the functionality of any of them according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules and units according to the embodiments of the present disclosure may be implemented by being split into a plurality of modules. Any one or more of the modules, units according to the embodiments of the present disclosure may be implemented at least partially as a hardware Circuit, such as a Field Programmable Gate Array (FPGA), a Programmable Logic Array (PLA), a system on a chip, a system on a substrate, a system on a package, an Application Specific Integrated Circuit (ASIC), or may be implemented by hardware or firmware in any other reasonable manner of integrating or packaging a Circuit, or implemented by any one of three implementations of software, hardware, and firmware, or any suitable combination of any of them. Alternatively, one or more of the modules, units according to embodiments of the present disclosure may be implemented at least partly as computer program modules, which, when executed, may perform the respective functions.

It should be noted that, in the embodiment of the present disclosure, a device part for training an obstacle avoidance model based on a near-end strategy of a strategy gradient corresponds to a method part for training an obstacle avoidance model based on a near-end strategy of a strategy gradient in the embodiment of the present disclosure, and for the description of the device part for training an obstacle avoidance model based on a near-end strategy of a strategy gradient, reference is specifically made to the method part for training an obstacle avoidance model based on a near-end strategy of a strategy gradient, which is not described herein again. Similarly, the obstacle avoidance device part of the mimic variable robot in the embodiment of the disclosure corresponds to the obstacle avoidance method part of the mimic variable robot in the embodiment of the disclosure, and the description of the obstacle avoidance device part of the mimic variable robot specifically refers to the obstacle avoidance method part of the mimic variable robot, and is not repeated herein.

Fig. 12 schematically shows a block diagram of an electronic device adapted to implement the above described method according to an embodiment of the present disclosure. The electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.

As shown in fig. 12, an electronic device 1200 according to an embodiment of the present disclosure may include a processor 1201 that may perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. The processor 1201 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 1201 may also include on-board memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the present disclosure.

In the RAM 1203, various programs and data necessary for the operation of the electronic apparatus 1200 are stored. The processor 1201, the ROM 1202, and the RAM 1203 are connected to each other by a bus 1204. The processor 1201 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1202 and/or the RAM 1203. Note that the programs may also be stored in one or more memories other than the ROM 1202 and the RAM 1203. The processor 1201 may also perform various operations of method flows according to embodiments of the present disclosure by executing programs stored in the one or more memories.

Electronic device 1200 may also include input/output (I/O) interface 1205, according to an embodiment of the disclosure, input/output (I/O) interface 1205 also connected to bus 1204. The system 1200 may also include one or more of the following components connected to the I/O interface 1205: an input section 1206 which may include a keyboard, mouse, etc.; an output portion 1207 which may include components such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section 1208 that may include a hard disk or the like; and a communication section 1209 which may include a network interface card such as a LAN card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.

According to embodiments of the present disclosure, method flows according to embodiments of the present disclosure may be implemented as computer software programs. For example, embodiments of the present disclosure may include a computer program product, which may include a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication portion 1209 and/or installed from the removable medium 1211. The computer program, when executed by the processor 1201, performs the above-described functions defined in the system of the embodiments of the present disclosure. The systems, devices, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

The present disclosure also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the disclosure.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable Computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), a portable compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the preceding. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Embodiments of the present disclosure may also include a computer program product, which may include a computer program containing program code for performing the method provided by embodiments of the present disclosure, when the computer program product is run on an electronic device, the program code for causing the electronic device to implement the method for training an obstacle avoidance model based on a proximal strategy gradient or the obstacle avoidance method for a mimicry deformable robot provided by embodiments of the present disclosure.

The computer program, when executed by the processor 1201, performs the above-described functions defined in the system/apparatus of the embodiments of the present disclosure. The systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the present disclosure.

In one embodiment, the computer program may be hosted on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal on a network medium, downloaded and installed through the communication section 1209, and/or installed from the removable medium 1211. The computer program containing program code may be transmitted using any suitable network medium, which may include but is not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

In accordance with embodiments of the present disclosure, program code for executing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, these computer programs may be implemented using high level procedural and/or object oriented programming languages, and/or assembly/machine languages. The programming language may include, but is not limited to, programming languages such as Java, C + +, python, the "C" language, or the like. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, which may include a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described separately above, this does not mean that the measures in the embodiments cannot be used in advantageous combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.

Claims

1. A method for near-end strategy optimization training of an obstacle avoidance model based on strategy gradients, the obstacle avoidance model being applied to a mimic deformable robot, the method comprising:

generating, for each training sample of a set of training samples for each walking phase, a plurality of strategies from the environmental information, wherein the training samples comprise environmental information acquired with the mimicry deformable robot including obstacle information, each of the strategies comprising a trajectory of the mimicry deformable robot walking in the external environment, the trajectory comprising a plurality of discrete actions and a state corresponding to each of the actions, the actions being generated with a strategy function comprising initial parameters, wherein the actions comprise at least one of: snake-shaped deformation action, spherical deformation action and square or rectangular deformation action;

processing the track by utilizing a reward and punishment function aiming at each strategy to obtain the accumulated reward of the track, wherein the reward and punishment function is determined according to the reward corresponding to the initial parameter, and the reward and punishment functions of the walking stages in different mimicry states are different from the strategy functions;

determining a desired reward for said strategy based on a plurality of said cumulative rewards corresponding to a plurality of said strategies and a probability corresponding to each of said trajectories, wherein said probability characterizes a probability of said mimicry transformable robot selecting a trajectory of a current strategy from among trajectories corresponding to a plurality of strategies based on a current state;

updating the initial parameter according to the first derivative based on a strategy gradient algorithm to obtain a target parameter of a trained obstacle avoidance model;

controlling the mimicry deformable robot to walk in the environment information of the walking stage by using the trained obstacle avoidance model, and iteratively using other training samples to train the obstacle avoidance model under the condition that the mimicry deformable robot collides with the obstacle information to obtain a target parameter of a new trained obstacle avoidance model;

and under the condition that the mimicry deformable robot does not collide with the obstacle information, determining the trained obstacle avoidance model as a target obstacle avoidance model in the walking stage.

2. The method of claim 1, wherein the generating a plurality of policies based on the environmental information comprises:

generating a plurality of current actions from the environmental information using the policy function;

processing the current action and the state by using a value function aiming at each current action to obtain the current value corresponding to the current action;

processing the current action and the current value using a merit function to generate a next action.

3. The method of claim 1, the policy gradient algorithm comprising a gradient ascent method;

wherein, the updating the initial parameter according to the first derivative based on the strategy gradient algorithm to obtain the target parameter of the trained obstacle avoidance model includes:

converting the first derivative by using a logarithmic function derivation formula to obtain a second derivative;

determining an expected average value of rewards by utilizing a plurality of accumulated rewards corresponding to a plurality of training samples based on the gradient ascending method;

determining a third derivative from the reward desired average and the second derivative;

and determining the target parameter according to the third derivative and the initial parameter.

4. The method of claim 3, wherein the reward penalty function R is as given in equation (1):

wherein θ represents an initial parameter, R _θ Indicating the cumulative award corresponding to the initial parameter when D<d is R _θ =250, when D>d is R _θ = 150,k is the discount factor; d represents the punishment distance of the mimic variable robot from the obstacle, and D represents the minimum distance of the mimic variable robot from the obstacle;

the first derivative

As shown in equation (2), the logarithmic function derivation equation is shown in equation (3), and the second derivative is shown in equation (3)

The third derivative is shown in equation (4)

As shown in equation (5):

wherein τ represents the trajectory; r represents a cumulative prize; θ represents an initial parameter; p is a radical of _θ Representing a probability;

representing a reward expectation mean; n represents the number of training samples; n represents the nth training sample; a represents an action; s represents a state; t represents the t-th; t represents the total number of actions or states.

5. The method of claim 1, wherein the updating the initial parameter according to the first derivative to obtain a target parameter of a trained obstacle avoidance model comprises:

converting the first derivative by using a logarithmic function derivation formula to obtain a fourth derivative;

determining a fifth derivative according to the fourth derivative and the reward average;

performing weight optimization processing on the fifth derivative to obtain a sixth derivative;

performing parameter replacement processing on the sixth derivative to obtain a seventh derivative;

determining the target parameter according to the seventh derivative and the initial parameter;

wherein, the updating the initial parameter according to the first derivative to obtain a target parameter of the trained obstacle avoidance model includes:

converting the first derivative by using a logarithmic function derivation formula to obtain an eighth derivative;

carrying out distribution conversion processing on the eighth derivative to obtain a ninth derivative;

obtaining a first gradient calculation value according to the dominance function and the ninth derivative;

obtaining a second gradient calculation value added with importance sampling according to the ninth derivative and the first gradient calculation value;

simplifying the second gradient calculated value to obtain a target gradient calculated value;

determining the target parameter according to the target gradient calculation value and the initial parameter;

wherein the ninth derivative

The first gradient calculation value is calculated as shown in equation (6)

The second gradient calculation value is calculated as shown in equation (7)

The target gradient calculation value is calculated as shown in equation (8)

As shown in equation (9):

wherein p is _θ Characterizing the probability, p, corresponding to the policy _θ′ Characterization of the strategy pi _θ Transition strategy pi obtained by conversion _θ′ Corresponding probability, A ^θ′ (s _t ,a _t ) Characterizing a merit function corresponding to the transition strategy, wherein,

e characterizes the reward.

6. The method of claim 5, further comprising:

determining an initial expected reward function according to the target gradient calculation value and the target parameter, wherein the initial expected reward function comprises a strategy distribution difference, the initial parameter and a behavior strategy parameter;

cutting the strategy distribution difference to obtain a target expectation reward function;

processing the behavior strategy parameters based on a gradient ascent method so that a maximum expected reward value is determined according to a trajectory corresponding to a target strategy and the target expected reward function;

processing the initial parameters by using a gradient descent method based on the maximum expected reward value to obtain transition initial parameters;

determining a transition initial parameter as a target initial parameter under the condition that a mean square error between a transition reward and a transition value is smaller than a preset threshold value, wherein the transition reward is determined according to a reward and punishment function, the transition initial parameter and a transition strategy, the transition value is determined according to a cost function, the transition initial parameter and the transition strategy, and the transition strategy is obtained by converting the strategy;

and determining a new reward and punishment function according to the target initial parameter and the reward and punishment function.

7. An obstacle avoidance method of a mimic deformable robot comprises the following steps:

aiming at each target walking stage, acquiring target environment information including target obstacle information acquired by a plurality of ultrasonic sensors of the mimicry deformable robot, and establishing a state space by determining the distance between the mimicry deformable robot and the obstacle;

processing the target environment information by using a trained target obstacle avoidance model, and outputting a target track of the target walking stage, wherein the target track comprises a plurality of discrete target actions walking in the target environment and a target state corresponding to each target action, and the target actions comprise at least one of the following: snake-shaped deformation action, spherical deformation action and square or rectangular deformation action;

and the mimicry deformable robot executes the walking operation of the target walking stage according to the target track, wherein the walking operation can avoid the mimicry deformable robot colliding with the target obstacle information.

8. The obstacle avoidance method according to claim 7, further comprising, before outputting the target trajectory:

generating a target accumulated reward according to the target track;

and updating target parameters in the obstacle avoidance model in the target walking stage according to the target accumulated reward.