CN115542901B

CN115542901B - Deformable robot obstacle avoidance method based on near-end strategy training

Info

Publication number: CN115542901B
Application number: CN202211154605.7A
Authority: CN
Inventors: 单光存; 丁则剑; 谭昊易
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2024-06-07
Anticipated expiration: 2042-09-21
Also published as: CN115542901A

Abstract

The invention provides a deformable robot obstacle avoidance method based on proximal strategy training, which is suitable for being applied to an obstacle avoidance model of a deformable robot. The method for training the obstacle avoidance model comprises the following steps: generating a plurality of strategies according to the environment information for each training sample; processing the tracks by using a reward and punishment function aiming at each strategy to obtain accumulated rewards; determining a desired prize according to the plurality of cumulative rewards and the probabilities; deriving a desired reward to obtain a first derivative; updating the initial parameters according to the first derivative based on a strategy gradient algorithm to obtain target parameters; controlling the simulated deformable robot to walk in the environment information by using the trained obstacle avoidance model, and iteratively training the obstacle avoidance model by using other training samples under the condition that the simulated deformable robot collides with the obstacle information to obtain target parameters of a new trained obstacle avoidance model; and determining the trained obstacle avoidance model as a target obstacle avoidance model under the condition that the mimicry deformable robot does not collide with the obstacle information.

Description

Deformable robot obstacle avoidance method based on near-end strategy training

Technical Field

The present disclosure relates to the field of robotics, and more particularly, to a method for proximal strategy optimization training obstacle avoidance model based on strategy gradients, an obstacle avoidance method, an apparatus, an electronic device, a computer readable storage medium, and a computer program product for mimicking a deformable robot.

Background

The simulated deformation robot needs to avoid collision with an obstacle in the operation process, and also needs to avoid collision with other robots or grabbing targets, and needs to perform complex nonlinear deformation to switch to different simulated motion modes when facing different environments and tasks.

In the process of implementing the disclosed concept, the inventor finds that at least the following problems exist in the related art: the traditional linear and nonlinear control theory based on rigid body model can not achieve satisfactory obstacle avoidance effect.

Disclosure of Invention

In view of this, embodiments of the present disclosure provide a method for training an obstacle avoidance model based on a proximal strategy of a strategy gradient, an obstacle avoidance method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for mimicking a deformable robot.

An aspect of an embodiment of the present disclosure provides a method for training an obstacle avoidance model based on a proximal strategy of a strategy gradient, where the obstacle avoidance model is applied to a mimicry deformable robot, the method including:

generating a plurality of strategies according to environmental information for each training sample in a training sample set of each walking stage, wherein the training samples comprise environmental information comprising obstacle information acquired by using a mimicry deformable robot, each strategy comprises a track of walking of the mimicry deformable robot in an external environment, the track comprises a plurality of discrete actions and states corresponding to each action, the actions are generated by using a strategy function comprising initial parameters, and the actions comprise at least one of the following: a serpentine deformation action, a spherical deformation action, and a square or rectangular deformation action;

processing the tracks by using reward and punishment functions aiming at each strategy to obtain accumulated rewards of the tracks, wherein the reward and punishment functions are determined according to rewards corresponding to initial parameters, and the reward and punishment functions and the strategy functions of different mimicry walking stages are different;

Determining expected rewards of strategies according to a plurality of accumulated rewards corresponding to the strategies and probabilities corresponding to each track, wherein the probabilities represent probabilities of selecting tracks of the current strategy from the tracks corresponding to the strategies according to the current state by the mimicry deformable robot;

Conducting derivation processing on the expected rewards to obtain a first derivative;

updating the initial parameters according to the first derivative based on a strategy gradient algorithm to obtain target parameters of the trained obstacle avoidance model;

Controlling the simulated deformable robot to walk in the environment information of the walking stage by using the trained obstacle avoidance model, and under the condition that the simulated deformable robot collides with the obstacle information, iteratively using other training samples to train the obstacle avoidance model to obtain the target parameters of the new trained obstacle avoidance model;

And under the condition that the mimicry deformable robot does not collide with the obstacle information, determining the trained obstacle avoidance model as a target obstacle avoidance model in the walking stage.

Another aspect of an embodiment of the present disclosure provides an obstacle avoidance method of a mimicking a deformable robot, including:

acquiring target environment information including target obstacle information acquired by a plurality of ultrasonic sensors of the mimicry deformable robot for each target walking stage;

Processing the target environment information by using a trained target obstacle avoidance model, and outputting a target track of the target walking stage, wherein the target track comprises a plurality of discrete target actions walking in the target environment and a target state corresponding to each target action, and the target actions comprise at least one of the following: serpentine deformation action, spherical deformation action, and rectangular deformation action;

and the mimicry deformable robot executes the walking operation of the target walking stage according to the target track, wherein the walking operation can avoid the mimicry deformable robot from colliding with the target obstacle information.

Another aspect of an embodiment of the present disclosure provides an apparatus for training an obstacle avoidance model based on a proximal strategy of a strategy gradient, including:

A generating module, configured to generate, for each training sample in a training sample set of each walking stage, a plurality of strategies according to environmental information, where the training sample includes environmental information including obstacle information acquired by using a mimicking deformable robot, each strategy includes a trajectory of walking of the mimicking deformable robot in an external environment, the trajectory includes a plurality of discrete actions and states corresponding to each action, the actions are generated by using a strategy function including initial parameters, and the actions include at least one of: a serpentine deformation action, a spherical deformation action, and a square or rectangular deformation action;

the first obtaining module is used for processing the tracks by using reward and punishment functions aiming at each strategy to obtain accumulated rewards of the tracks, wherein the reward and punishment functions are determined according to rewards corresponding to initial parameters, and the reward and punishment functions and the strategy functions in different mimicry walking stages are different;

The first determining module is used for determining expected rewards of strategies according to a plurality of accumulated rewards corresponding to the strategies and probabilities corresponding to each track, wherein the probabilities represent probabilities of the track of the current strategy selected by the mimicry deformable robot from the tracks corresponding to the strategies according to the current state;

The second obtaining module is used for conducting derivation processing on the expected rewards to obtain a first derivative;

The third obtaining module is used for updating the initial parameters according to the first derivative based on a strategy gradient algorithm to obtain target parameters of the trained obstacle avoidance model;

the simulation module is used for controlling the simulated deformable robot to walk in the environmental information of the walking stage by using the trained obstacle avoidance model, and under the condition that the simulated deformable robot collides with the obstacle information, the simulated deformable robot is used for iteratively training the obstacle avoidance model by using other training samples to obtain the target parameters of the new trained obstacle avoidance model;

And the second determining module is used for determining the trained obstacle avoidance model as the target obstacle avoidance model in the walking stage under the condition that the mimicry deformable robot does not collide with the obstacle information.

Another aspect of an embodiment of the present disclosure provides an obstacle avoidance device mimicking a deformable robot, including:

The acquisition module is used for acquiring target environment information comprising target obstacle information acquired by a plurality of ultrasonic sensors of the mimicry deformable robot aiming at each target walking stage, and establishing a state space by determining the distance between the mimicry deformable robot and the obstacle;

the output module is used for processing the target environment information by utilizing the target obstacle avoidance model and outputting a target track of the target walking stage, wherein the target track comprises a plurality of discrete target actions walking in the target environment and a target state corresponding to each target action, and the target actions comprise at least one of the following: a serpentine deformation action, a spherical deformation action, and a square or rectangular deformation action;

and the execution module is used for executing the walking operation of the target walking stage according to the target track by the mimicry deformable robot, wherein the walking operation can avoid the mimicry deformable robot from colliding with the target obstacle information.

Another aspect of an embodiment of the present disclosure provides an electronic device, including: one or more processors; and a memory for storing one or more programs, wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method as described above.

Another aspect of an embodiment of the present disclosure provides a computer-readable storage medium storing computer-executable instructions that, when executed, are configured to implement a method as described above.

Another aspect of the disclosed embodiments provides a computer program product comprising computer executable instructions which, when executed, are to implement a method as described above.

According to the embodiment of the disclosure, the corresponding strategy is determined for the training sample, and the expected rewards of the corresponding track are determined based on the strategy, so that the first derivative determined by deriving the corresponding strategy is updated to obtain the target obstacle avoidance model, the simulated deformable robot can process a complex obstacle avoidance scene when walking in the environment by using the target obstacle avoidance model, and a better obstacle avoidance effect can be realized in the complex obstacle avoidance scene, so that the technical problem that the traditional linear and nonlinear control theory based on the rigid body model cannot achieve a satisfactory obstacle avoidance effect is at least partially overcome.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments thereof with reference to the accompanying drawings in which:

FIG. 1 schematically illustrates an exemplary system architecture to which a method of training an obstacle avoidance model may be applied, in accordance with an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of a method of training an obstacle avoidance model, in accordance with an embodiment of the present disclosure;

FIG. 3 schematically illustrates a schematic diagram of a near-end policy generation plurality of policies according to an embodiment of the disclosure;

FIG. 4 schematically illustrates a track sequence diagram according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a schematic diagram of a PPO-Clip algorithm according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a schematic view of obstacle avoidance principles of a humanoid deformable robot in accordance with an embodiment of the disclosure;

FIG. 7 schematically illustrates an obstacle avoidance scenario diagram of a mimicry deformable robot according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a obstacle avoidance success rate diagram of a mimicry deformable robot according to an embodiment of the present disclosure;

FIG. 9 schematically illustrates a flow chart of a method of obstacle avoidance of a mimicry deformable robot according to an embodiment of the present disclosure;

FIG. 10 schematically illustrates a block diagram of an apparatus for training an obstacle avoidance model, in accordance with an embodiment of the present disclosure;

FIG. 11 schematically illustrates a block diagram of an obstacle avoidance device of a humanoid deformable robot in accordance with an embodiment of the disclosure;

Fig. 12 schematically shows a block diagram of an electronic device adapted to implement the method described above, according to an embodiment of the disclosure.

Detailed Description

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is only exemplary and is not intended to limit the scope of the present disclosure. In the following detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It may be evident, however, that one or more embodiments may be practiced without these specific details. In addition, in the following description, descriptions of well-known structures and techniques are omitted so as not to unnecessarily obscure the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. The terms "comprises," "comprising," and/or the like, as used herein, specify the presence of stated features, steps, operations, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, or components.

All terms (which may include technical and scientific terms) used herein have the meanings commonly understood by one of ordinary skill in the art unless otherwise defined. It should be noted that the terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly formal manner.

Where a convention analogous to "at least one of A, B and C, etc." is used, in general such a convention should be interpreted in accordance with the meaning of one having ordinary skill in the art (e.g., "a system having at least one of A, B and C" would include but not be limited to systems having a alone, B alone, C alone, a and B together, a and C together, B and C together, and/or A, B, C together, etc.).

Embodiments of the present disclosure provide a method for training an obstacle avoidance model based on a proximal strategy of a strategy gradient, an obstacle avoidance method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product for mimicking a deformable robot. The method for training the obstacle avoidance model can comprise the following steps: for each training sample, generating a strategy comprising a plurality of actions according to the environment information by a strategy function; aiming at each strategy, processing the track by using a reward and punishment function to obtain accumulated rewards; determining a desired prize according to the plurality of cumulative rewards and the probabilities; conducting derivation processing on the expected rewards to obtain a first derivative; updating the initial parameters according to the first derivative based on a strategy gradient algorithm to obtain target parameters; controlling the simulated deformable robot to walk in the environment information by using the trained obstacle avoidance model, and under the condition that the simulated deformable robot collides with the obstacle information, iteratively using other training samples to train the obstacle avoidance model to obtain target parameters of a new trained obstacle avoidance model; and under the condition that the mimicry deformable robot does not collide with the obstacle information, determining the trained obstacle avoidance model as a target obstacle avoidance model.

Fig. 1 schematically illustrates an exemplary system architecture 100 to which a method of training an obstacle avoidance model may be applied, in accordance with an embodiment of the present disclosure. It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios.

As shown in fig. 1, a system architecture 100 according to this embodiment may include mimicry deformable robots 101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the mimicry deformable robots 101, 102, 103 and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the mimicry deformable robots 101, 102, 103 to receive or send messages, etc. The mimicry deformable robots 101, 102, 103 may have various communication client applications installed thereon, such as training class applications, environmental information processing applications.

The mimicry deformable robots 101, 102, 103 may be various robots having robotic arms and supporting walking, deformation.

The server 105 may be a server providing various services, such as a background management server (by way of example only) providing a track of walking to a user using an environment in which the humanoid deformable robots 101, 102, 103 are operating. The background management server may analyze and process the received environmental information and other data, and feed back the processing result (e.g., the target obstacle avoidance model generated according to the environmental information) to the anthropomorphic deformable robot.

It should be noted that the method for training the obstacle avoidance model based on the proximal strategy of the strategy gradient provided in the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the device for training the obstacle avoidance model based on the proximal strategy of the strategy gradient provided in the embodiments of the present disclosure may be generally disposed in the server 105. The method of training the obstacle avoidance model based on the proximal strategy of the strategy gradient provided by the embodiments of the present disclosure may also be performed by a server or cluster of servers other than the server 105 and capable of communicating with the mimicry deformable robots 101, 102, 103 and/or the server 105. Accordingly, the device for training the obstacle avoidance model based on the proximal strategy of the strategy gradient provided in the embodiments of the disclosure may also be provided in a server or a server cluster different from the server 105 and capable of communicating with the mimicry deformable robots 101, 102, 103 and/or the server 105. Alternatively, the method for training the obstacle avoidance model based on the proximal strategy of the strategy gradient provided by the embodiments of the present disclosure may be performed by the mimicking deformable robot 101, 102, or 103, or may be performed by another mimicking deformable robot different from the mimicking deformable robot 101, 102, or 103. Accordingly, the device for training the obstacle avoidance model based on the proximal strategy of the strategy gradient provided in the embodiments of the present disclosure may also be disposed in the mimicking deformable robot 101, 102, or 103, or in another mimicking deformable robot different from the mimicking deformable robot 101, 102, or 103.

It should be understood that the number of mimicry deformable robots, networks, and servers in fig. 1 is merely illustrative. There may be any number of mimicry deformable robots, networks, and servers, as desired for implementation.

Fig. 2 schematically illustrates a flowchart of a method of training an obstacle avoidance model, according to an embodiment of the disclosure.

As shown in fig. 2, a method of training an obstacle avoidance model based on a proximal strategy of a strategy gradient, the obstacle avoidance model being applied to a mimicry deformable robot, may include operations S201 to S207.

In operation S201, for each training sample in the training sample set of each walking phase, a plurality of strategies are generated according to the environmental information, wherein the training samples may include the environmental information acquired by the mimicry deformable robot, which may include obstacle information, each strategy may include a trajectory of the mimicry deformable robot walking in the external environment, the trajectory may include a plurality of discrete actions and states corresponding to each action, the actions are generated by using a strategy function that may include initial parameters, wherein the actions include at least one of: serpentine deformation action, spherical deformation action, square or rectangular deformation action.

In operation S202, for each strategy, the trajectory is processed by using a reward and punishment function to obtain a cumulative reward of the trajectory, wherein the reward and punishment function is determined according to the rewards corresponding to the initial parameters, and the reward and punishment function and the strategy function of different mimicking walking stages are different.

In operation S203, a desired reward of the policy is determined according to a plurality of cumulative rewards corresponding to the plurality of policies and a probability corresponding to each track, wherein the probability characterizes a probability of selecting a track of the current policy from tracks corresponding to the plurality of policies according to the current state by the mimicry deformable robot.

In operation S204, a derivative process is performed on the desired prize to obtain a first derivative.

In operation S205, the initial parameters are updated according to the first derivative based on the strategy gradient algorithm, to obtain the target parameters of the trained obstacle avoidance model.

In operation S206, the simulated deformable robot is controlled to walk in the environmental information of the walking stage by using the trained obstacle avoidance model, and in the case that the simulated deformable robot collides with the obstacle information, the obstacle avoidance model is trained by using other training samples iteratively, so as to obtain the target parameters of the new trained obstacle avoidance model.

In operation S207, in the case where the mimicry deformable robot does not collide with the obstacle information, the trained obstacle avoidance model is determined as a target obstacle avoidance model of the walking phase.

According to the embodiment of the disclosure, the mimicry deformable robot can refer to a robot based on the characteristic of free deformation of a self structure, and the mimicry deformable robot can realize the deformation exactly like a living organism under the control of an intelligent feedback control algorithm.

According to embodiments of the present disclosure, the deformable form (motion), i.e., the motion space, of the mimicking deformable robot may include, but is not limited to, a serpentine deformation motion, a spherical deformation motion, a square or rectangular deformation motion, etc., and the mimicking deformable robot makes different discrete motions according to acquired environmental information including obstacle information, so as to avoid collision. According to an embodiment of the present disclosure, a policy may refer to a walking path in an environment, for example, when walking in an environment with one obstacle, a path from the left side of the obstacle to a destination is one policy, a path from the right side of the obstacle to a destination is another policy, and a walking phase may refer to a walking path simulating a next step of a deformable robot at a different position.

According to the embodiment of the disclosure, when training the obstacle avoidance model, multiple strategies for mimicking the deformable robot to bypass the obstacle can be generated according to the environmental information in the training sample, for each strategy, the locus tau of the strategy is processed by using a reward and punishment function, the accumulated rewards of the locus can be obtained, for the multiple accumulated rewards of the multiple strategies and the probability p _θ (tau) corresponding to each locus tau, the expected rewards of the strategy are determinedConducting derivation processing on the expected rewards to obtain a first derivative/>, shown in a formula (3)

Where, the trajectory τ may be represented by τ= { s ₁,a₁,s₂,a₂,...,s_t,a_t }, a _t represents the action of the t-th step, and s _t represents the state of the t-th step. The probability p _θ (τ) of the trajectory τ can be expressed by equation (1), the desired prizeCan be expressed by formula (2).

Where T represents the total number of actions, and p _θ each represent the probability of selecting the current action.

According to an embodiment of the present disclosure, based on a strategy gradient algorithm, according to a first derivativeAnd updating the initial parameter theta to obtain a target parameter theta DEG of the trained obstacle avoidance model. And controlling the simulated deformable robot to perform simulated walking in the environment information of the training sample by using the trained obstacle avoidance model, if the simulated deformable robot collides with the obstacle information, iteratively training the obstacle avoidance model by using other training samples to obtain the target parameters of the new trained obstacle avoidance model until the simulated deformable robot is controlled to walk in the environment information by using the trained obstacle avoidance model without collision, and determining the trained obstacle avoidance model as the target obstacle avoidance model of the walking stage.

FIG. 3 schematically illustrates a schematic diagram of a near-end policy generation plurality of policies according to an embodiment of the disclosure. Fig. 4 schematically illustrates a track sequence diagram according to an embodiment of the present disclosure.

As shown in fig. 3 and 4, generating a plurality of policies according to the environment information may include the following operations:

And generating a plurality of current actions according to the environment information by using the strategy function. For each of the plurality of current actions, the current actions and states are processed using a cost function to obtain a current value corresponding to the current actions. And processing the current action and the current value by utilizing the advantage function to generate the next action.

According to an embodiment of the present disclosure, a plurality of current actions a _t (i.e., a _t in fig. 3) are generated according to environment information by using a policy function (i.e., an Actor network in fig. 3), states S _t (i.e., S _t in fig. 3) corresponding to the actions a _t are extracted from an experience playback pool, the current actions and states are processed by using a merit function (i.e., a Critic network in fig. 3) to obtain a current value V (S _t) corresponding to the current actions, and the current actions a _t and the current value V (S _t) are processed by using a merit function a ^θ′(S_t,A_t to generate a next action a _t+1, so that a track sequence shown in fig. 4 may be obtained. Wherein, the experience playback pool at least stores the state corresponding to each action in the history state, G in fig. 4 represents the cumulative prize, and the actions and states of each row are a track, i.e. a policy.

According to embodiments of the present disclosure, the strategic gradient algorithm may include a gradient ramp-up method.

Based on a strategy gradient algorithm, updating initial parameters according to a first derivative to obtain target parameters of a trained obstacle avoidance model, wherein the method can comprise the following operations:

And converting the first derivative by using a logarithmic function derivative formula to obtain a second derivative. Based on the gradient ascent method, a desired average of rewards is determined using a plurality of cumulative rewards corresponding to the plurality of training samples. The third derivative is determined based on the bonus desired average and the second derivative. The target parameter is determined based on the third derivative and the initial parameter.

According to an embodiment of the present disclosure, the first derivative is derived using a logarithmic function derivative formulaPerforming conversion treatment to obtain second derivative/>Assuming that each track tau gets a cumulative prize of R (tau), based on a gradient-increasing method, a weighted summation is performed using the probability p _θ (tau) of occurrence of the track tau to determine a prize desired average R (tau ⁿ), based on the prize desired average R (tau ⁿ) and the second derivative/>Determination of the third derivative/>According to the third derivative/>And an initial parameter θ, determining a target parameter θ°, the calculation of the target parameter θ° being referred to formula (4).

Where η represents the weight, i.e. the reward obtained after an action is performed.

According to an embodiment of the present disclosure, the reward and punishment function R is as shown in formula (5):

Wherein θ represents an initial parameter, R _θ represents a cumulative prize corresponding to the initial parameter, R _θ =250 when D is less than D, R _θ = -150 when D is greater than D, and k is a discount factor specifically set according to practical situations; d represents the punishment distance of the mimicry deformable robot from the obstacle and D represents the minimum distance of the mimicry deformable robot from the obstacle.

The logarithmic function derivative formula is shown as formula (6), the second derivativeThe third derivative is as shown in equation (7)As shown in formula (8):

Wherein τ represents a trajectory; r represents a cumulative prize; θ represents an initial parameter; p _θ represents probability; e _τ～pθ (τ) represents the prize desired average; n represents the number of training samples; n represents the nth training sample; a represents an action; s represents a state; t represents the t-th; t represents the total number of actions or states.

According to another embodiment of the present disclosure, updating the initial parameters according to the first derivative to obtain target parameters of the trained obstacle avoidance model may include the following operations:

And converting the first derivative by using a logarithmic function derivative formula to obtain a fourth derivative. The fifth derivative is determined based on the fourth derivative and the prize average. And carrying out weight optimization processing on the fifth derivative to obtain a sixth derivative. And carrying out parameter replacement processing on the sixth derivative to obtain a seventh derivative. The target parameter is determined based on the seventh derivative and the initial parameter.

According to an embodiment of the present disclosure, the fourth derivative may be different from the second derivativeAnd the first derivatives are obtained by converting the first derivatives by using a logarithmic function derivative formula.

According to the embodiment of the disclosure, the core idea of the strategy gradient algorithm is to reduce the sampling probability of rewarding fewer actions by increasing the sampling probability of rewarding more actions, so that the intelligent agent learns the optimal behavior strategy. However, there is a general phenomenon that when designing a reward and punishment function, the rewards for most actions are positive, which results in that the agent (the mimicry deformable robot of the present disclosure) may learn the following strategies: when training starts, fewer actions are collected, the probability of the actions after updating is increased, and in subsequent training, the smaller actions are sampled for multiple times, so that the probability of occurrence is larger and larger, the probability of occurrence of more actions is gradually exceeded, and the intelligent agent learns a suboptimal strategy and falls into local optimum. The reason for the phenomenon is that the action rewards are positive numbers by the designed rewards and punishment functions, so that the problem of local optimization occurs in the strategy gradient algorithm. By taking the first derivativeAdding the average value E [ R (tau) ] of rewards after the agent is trained once as a base line in the calculation formula of (a), so that action rewards are positive and negative in the calculation process, and the suboptimal solution caused by the problems is eliminated. The value of R (τ) is continuously recorded and the average value is calculated during training, and the baseline is continuously updated. Fifth derivative/>The calculation formula is shown as formula (8):

According to an embodiment of the present disclosure, the fifth derivative according to equation (8) All state-action pairs will be weighted with the same rewards. However, the rewards of different actions in a round are good or bad, and do not represent that all actions on a track are opposite, and each different action needs to be assigned a different weight. Since it is difficult to sample enough data in actual training, in order to assign a reasonable weight to each different action, only rewards derived from this action are calculated to achieve a reasonable assignment task. /(I)A prize representing the nth trace, the t' th action, then the sixth derivative/>Can be expressed by the formula (9):

In accordance with an embodiment of the present disclosure, when representing a track reward, Is a general representation method and does not represent the trade-off of future rewards at the current moment, so the sixth derivative/>, is neededIn/>Replaced by/>Thus, the seventh derivative/>, shown in the formula (10), can be obtainedThus referring to equation (4), according to the seventh derivative/>And an initial parameter θ, determining a target parameter.

Wherein,As a dominance function a ^θ′(s_t,a_t).

According to an embodiment of the present disclosure, updating the initial parameter according to the first derivative to obtain the target parameter may include the following operations:

And converting the first derivative by using a logarithmic function derivative formula to obtain an eighth derivative. And carrying out distribution conversion processing on the eighth derivative to obtain a ninth derivative. And obtaining a first gradient calculated value according to the advantage function and the ninth derivative. And obtaining a second gradient calculated value added with the importance sample according to the ninth derivative and the first gradient calculated value. And simplifying the second gradient calculated value to obtain a target gradient calculated value. And determining the target parameter according to the target gradient calculated value and the initial parameter.

According to an embodiment of the disclosure, the eighth derivative may be the same as the second derivativeThe same is obtained by converting the first derivative by using a logarithmic function derivative formula

According to the embodiments of the present disclosure, the Policy gradient algorithm is typically the same Policy (On Policy) algorithm, that is, the same Policy is used for the motion Policy and the evaluation Policy, which results in extremely low data utilization, because the gradient of the objective function can be calculated and the network parameters can be updated only when the agent (the mimicry deformable robot) collects the data of one complete round, and when the network parameters are updated, the data can be discarded, new data can be collected again to update the network, and the training speed of the network can be very slow. In order to solve the problem of slow training speed of the strategy gradient algorithm, the PPO algorithm provides an importance sampling (Important-sampling) idea.

The importance sampling thought is provided to enable an agent of the PPO algorithm to use the track sequence data sampled by the historical behavior strategies with different parameters when the current behavior strategy is updated, so that the PPO algorithm can utilize the historical data although the PPO algorithm is a same strategy (On Policy) algorithm. The training speed of the network model is improved because the network can be updated by using the historical data. The data used for updating the network parameters is assumed to come from p distribution, but only data can be acquired from q distribution, so that a function formula needs to be corrected for conversion when the function is expected to be obtained, an importance sampling formula is shown as a formula (11), and the formula (11) is also a distribution conversion formula.

The p-distribution and the q-distribution are obtained from environmental information acquired by the mimicry deformable robot, and for example, a threshold range may be set, environmental information within the threshold range is p-distribution, and environmental information outside the threshold range is q-distribution.

According to an embodiment of the present disclosure, the eighth derivative is subjected to a distribution conversion process based on a distribution conversion formula, resulting in a ninth derivative as shown in formula (12)Using the dominance function A ^θ′(s_t,a_t) to replace R (τ) to obtain a first gradient calculationJoint of the ninth derivative/>And a first gradient calculation value/>Obtaining a second gradient calculation value/>, to which importance samples are addedSince there is no big difference between the distribution of hypothetical strategy pi _θ and strategy pi _θ′, therefore/>Approximately 1, so that a value/> can be calculated for the second gradientPerforming simplified processing to obtain target gradient calculated value/>Referring to equation (4), a value/> is calculated from the target gradientAnd an initial parameter θ, determining a target parameter.

It should be noted that, the policies pi _θ and pi _θ′ are obtained by sampling from the p distribution and the q distribution, respectively.

According to an embodiment of the present disclosure, the ninth derivativeFirst gradient calculated value/>, as shown in equation (12)Second gradient calculated value/>, as shown in equation (13)Target gradient calculation/>, as shown in equation (14)As shown in equation (15):

Wherein p _θ characterizes the probability corresponding to the policy, p _θ′ characterizes the probability corresponding to the transition policy pi _θ′ obtained by converting the policy pi _θ, a ^θ′(s_t,a_t) characterizes the dominance function corresponding to the transition policy, wherein, E characterizes the reward.

According to an embodiment of the present disclosure, the method of training the obstacle avoidance model may further comprise the operations of:

An initial expected rewards function is determined based on the target gradient calculation and the target parameter, wherein the initial expected rewards function may include policy distribution differences, initial parameters, and behavior policy parameters. And cutting the strategy distribution difference to obtain the target expected rewarding function.

The behavior policy parameters are processed based on a gradient ascent method such that a maximum desired prize value is determined from the trajectory corresponding to the target policy and the target desired prize function. And processing the initial parameters by using a gradient descent method based on the maximum expected reward value to obtain transition initial parameters.

And under the condition that the mean square error between the transition rewards and the transition values is smaller than a preset threshold value, determining the transition initial parameters as target initial parameters, wherein the transition rewards are determined according to a reward and punishment function, the transition initial parameters and a transition strategy, the transition values are determined according to the cost function, the transition initial parameters and the transition strategy, and the transition strategy is obtained by converting the strategy. And determining a new reward and punishment function according to the target initial parameters and the reward and punishment function.

According to embodiments of the present disclosure, values are calculated from a target gradientAnd a target parameter θ°, determining an initial desired reward function J ^θ°′(θ^° as shown in equation (16).

Fig. 5 schematically shows a schematic diagram of a PPO-Clip algorithm according to an embodiment of the present disclosure.

As shown in FIG. 5, the PPO-Clip algorithm uses a Clip function to differentiate policy distributionsThe manual clipping is performed so that the target desired bonus function can be obtained as shown in formula (17).

According to an embodiment of the present disclosure, when a ^θ°′(s_t,a_t) is greater than 0,Where A ^θ°′(s_t,a_t) is less than 0,/>The minimum value of (2) is 1-epsilon.

According to an embodiment of the present disclosure, behavior policy parameters are processed based on a gradient ascent method such that a maximum desired prize value is determined from a trajectory and a target desired prize function corresponding to a target policy pi _θ′, and behavior policy parameters are processed based on a gradient ascent method such that a maximum desired prize value is determined from a trajectory and a target desired prize function corresponding to a target policy pi _θ′ Based on the maximum expected prize value/>Processing the initial parameter theta by using a gradient descent method to obtain a transition initial parameter theta-, and rewarding at transition/>And transition value/>In case the mean square error between them is smaller than a preset threshold w, i.eThe transition initial parameters are determined as target initial parameters. And determining a new reward and punishment function according to the target initial parameters and the reward and punishment function.

Fig. 6 schematically illustrates a schematic view of obstacle avoidance principle of a mimicry deformable robot according to an embodiment of the present disclosure. Fig. 7 schematically illustrates an obstacle avoidance scenario diagram of a mimicry deformable robot according to an embodiment of the present disclosure. Fig. 8 schematically illustrates a schematic view of obstacle avoidance success rate of a mimicry deformable robot according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, based on the obstacle avoidance principle shown in fig. 6, a simulated deformable robot performs a simulation test in the underwater environment shown in fig. 7, and a curve of the simulated deformable robot obstacle avoidance success rate as a function of the training times is shown in fig. 8. As can be seen from fig. 8, the statistics data at the beginning stage of training is less, the obstacle avoidance success times and failure times have a larger influence on the obstacle avoidance success rate, so that the obstacle avoidance success rate curve at the beginning stage of the experiment is more tortuous, and the fluctuation of the curve gradually decreases with the increase of the training quantity. Finally, the obstacle avoidance success rate of the mimicry deformable robot is converged to about 80%. The result not only shows that the PPO obstacle avoidance method has excellent obstacle avoidance capability in a complex obstacle avoidance scene, but also verifies that the PPO obstacle avoidance method has excellent robustness.

Therefore, the obstacle avoidance method based on the near-end strategy optimization of the measurement gradient has the capability of flexibly and effectively processing the unknown complex dynamic obstacle avoidance scene, the model finally converges normally, and the good obstacle avoidance success rate and the good robustness are maintained in the new obstacle avoidance scene.

Fig. 9 schematically illustrates a flow chart of a method of obstacle avoidance of a humanoid deformable robot in accordance with an embodiment of the disclosure.

As shown in fig. 9, the obstacle avoidance method of the mimicry deformable robot may include operations S901 to S903.

In operation S901, for each target walking phase, target environment information, which may include target obstacle information, acquired by a plurality of ultrasonic sensors of a mimicry deformable robot is acquired.

In operation S902, the target environment information is processed using the target obstacle avoidance model, and a target track of the target walking stage is output, where the target track may include a plurality of discrete target actions walking in the target environment and a target state corresponding to each target action, and the target actions include at least one of: serpentine deformation action, spherical deformation action, and square or rectangular deformation action.

In operation S903, the mimicry deformable robot performs a walking operation of the target walking stage according to the target trajectory, wherein the walking operation can avoid collision of the mimicry deformable robot with the target obstacle information.

According to the embodiment of the disclosure, after target environment information, which may include target obstacle information, is acquired by a plurality of ultrasonic sensors of a mimicry deformable robot for each target walking stage, the mimicry deformable robot is controlled to walk in the target environment information of the target walking stage by using a trained obstacle avoidance model, target point coordinates are randomly initialized at the beginning of each round, and the mimicry deformable robot is reset at a central coordinate (0, 0). If the simulated unmanned system can successfully avoid all target obstacles and successfully reach the target point, the success number is increased by 1, and if the simulated unmanned system collides with the obstacles in the driving process, the failure number is increased by 1.

According to the embodiment of the disclosure, the training samples are determined to correspond to strategies, and the expected rewards of the corresponding tracks are determined based on the strategies, so that the first derivative determined by deriving the training samples is updated to initial parameters, a obstacle avoidance model is obtained, a simulated deformable robot can process a complex obstacle avoidance scene when walking in the environment by using the obstacle avoidance model, and a good obstacle avoidance effect can be achieved in the complex obstacle avoidance scene, so that the technical problem that the traditional rigid body model-based linear and nonlinear control theory cannot achieve satisfactory obstacle avoidance effects is at least partially solved.

According to an embodiment of the present disclosure, before outputting the target trajectory, the following operations may be further included:

and generating a target accumulated rewards according to the target track, and updating target parameters in the target obstacle avoidance model according to the target accumulated rewards.

According to the embodiment of the disclosure, the target parameters are updated in real time through the environmental information of the simulated deformable robot in actual use, so that the obstacle avoidance model is updated and optimized through repeated learning.

Fig. 10 schematically illustrates a block diagram of an apparatus for training an obstacle avoidance model, according to an embodiment of the disclosure.

As shown in fig. 10, an apparatus 1000 for training an obstacle avoidance model based on a proximal strategy of a strategy gradient may include a generation module 1001, a first derivation module 1002, a first determination module 1003, a second derivation module 1004, a third derivation module 1005, a simulation module 1006, and a second determination module 1007.

A generating module 1001, configured to generate, for each training sample in the training sample set of each walking stage, a plurality of policies according to the environmental information, where the training sample may include environmental information that may include obstacle information and that is acquired by using the mimicry deformable robot, each policy may include a trajectory of walking of the mimicry deformable robot in an external environment, the trajectory may include a plurality of discrete actions and states corresponding to each action, the actions are generated by using a policy function that may include initial parameters, and the actions include at least one of: serpentine deformation action, spherical deformation action, square or rectangular deformation action.

The first obtaining module 1002 is configured to process the tracks by using a reward and punishment function for each policy to obtain a cumulative reward of the tracks, where the reward and punishment function is determined according to the rewards corresponding to the initial parameters, and the reward and punishment function and the policy function in different simulated walking stages are different.

A first determining module 1003, configured to determine, according to a plurality of cumulative rewards corresponding to a plurality of policies and probabilities corresponding to each track, a desired reward of the policies, where the probabilities represent probabilities of the mimicry deformable robot selecting a track of a current policy from tracks corresponding to the plurality of policies according to a current state.

A second deriving module 1004 is configured to derive a first derivative from the desired prize.

A third obtaining module 1005 is configured to update the initial parameters according to the first derivative based on a policy gradient algorithm to obtain target parameters of the trained obstacle avoidance model.

The simulation module 1006 is configured to control the simulated deformable robot to walk in the environmental information of the walking stage by using the trained obstacle avoidance model, and iteratively perform training of the obstacle avoidance model by using other training samples under the condition that the simulated deformable robot collides with the obstacle information, so as to obtain the target parameters of the new trained obstacle avoidance model.

A second determining module 1007, configured to determine the trained obstacle avoidance model as a target obstacle avoidance model in the walking stage in the case where the simulated deformable robot does not collide with the obstacle information.

According to an embodiment of the present disclosure, the generating module 1001 may include a first generating unit, a first obtaining unit, and a second generating unit.

The first generation unit is used for generating a plurality of current actions according to the environment information by utilizing the strategy function.

A first obtaining unit, configured to process, for each of the plurality of current actions, the current actions and states by using a cost function, and obtain a current value corresponding to the current actions.

And the second generating unit is used for processing the current action and the current value by utilizing the advantage function and generating the next action.

According to an embodiment of the present disclosure, the third obtaining module 1005 may include a first converting unit, a first determining unit, a second determining unit, and a third determining unit.

The first conversion unit is used for converting the first derivative by using a logarithmic function derivative formula to obtain a second derivative.

And the first determining unit is used for determining a reward expected average value by utilizing a plurality of accumulated rewards corresponding to a plurality of training samples based on the gradient rising method.

And a second determining unit for determining a third derivative according to the bonus expected average and the second derivative.

And a third determining unit for determining a target parameter based on the third derivative and the initial parameter.

According to an embodiment of the present disclosure, the third obtaining module 1005 may include a second converting unit, a fourth determining unit, an optimizing unit, a replacing unit, and a fifth determining unit.

And the second conversion unit is used for converting the first derivative by using a logarithmic function derivative formula to obtain a fourth derivative.

And a fourth determining unit for determining a fifth derivative according to the fourth derivative and the average value of rewards.

And the optimizing unit is used for carrying out weight optimizing processing on the fifth derivative to obtain a sixth derivative.

And the replacing unit is used for carrying out parameter replacement processing on the sixth derivative to obtain a seventh derivative.

And a fifth determining unit for determining a target parameter based on the seventh derivative and the initial parameter.

According to an embodiment of the present disclosure, the third obtaining module 1005 may include a third converting unit, a fourth converting unit, a second obtaining unit, a third obtaining unit, a simplifying unit, and a sixth determining unit.

And the third conversion unit is used for converting the first derivative by using a logarithmic function derivative formula to obtain an eighth derivative.

And the fourth conversion unit is used for carrying out distribution conversion processing on the eighth derivative to obtain a ninth derivative.

And the second obtaining unit is used for obtaining a first gradient calculated value according to the advantage function and the ninth derivative.

And a third obtaining unit, configured to obtain a second gradient calculation value added with the importance sample according to the ninth derivative and the first gradient calculation value.

And the simplifying unit is used for carrying out simplifying processing on the second gradient calculated value to obtain a target gradient calculated value.

And a sixth determining unit for determining the target parameter according to the target gradient calculation value and the initial parameter.

According to an embodiment of the present disclosure, the third obtaining module 1005 may further include a seventh determining unit, a fourth obtaining unit, an eighth determining unit, a fifth obtaining unit, a ninth determining unit, and a tenth determining unit.

A seventh determining unit, configured to determine an initial expected rewarding function according to the target gradient calculation value and the target parameter, where the initial expected rewarding function may include a policy distribution difference, an initial parameter, and a behavior policy parameter.

And a fourth obtaining unit for clipping the strategy distribution difference to obtain the target expected rewarding function.

And an eighth determining unit for processing the behavior policy parameters based on the gradient ascent method so that the maximum expected prize value is determined according to the trajectory corresponding to the target policy and the target expected prize function.

And a fifth obtaining unit, configured to process the initial parameters by using a gradient descent method based on the maximum expected reward value, and obtain the transition initial parameters.

And a ninth determining unit, configured to determine a transition initial parameter as a target initial parameter when a mean square error between a transition reward and a transition value is smaller than a preset threshold, where the transition reward is determined according to a reward and punishment function, the transition initial parameter, and a transition policy, the transition value is determined according to the cost function, the transition initial parameter, and the transition policy is obtained by converting a policy.

And the tenth determining unit is used for determining a new reward and punishment function according to the target initial parameters and the reward and punishment function.

Fig. 11 schematically illustrates a block diagram of an obstacle avoidance device of a humanoid deformable robot in accordance with an embodiment of the disclosure.

As shown in fig. 11, the obstacle avoidance apparatus 1100 of the mimicry deformable robot may include an acquisition module 1101, an output module 1102, and an execution module 1103.

The acquiring module 1101 is configured to acquire, for each target walking stage, target environment information acquired by a plurality of ultrasonic sensors of the mimicry deformable robot, which may include target obstacle information.

The output module 1102 is configured to process the target environment information by using the target obstacle avoidance model, and output a target track of a target walking stage, where the target track may include a plurality of discrete target actions walking in the target environment and a target state corresponding to each target action, where the target actions include at least one of: serpentine deformation action, spherical deformation action, square or rectangular deformation action.

The execution module 1103 is configured to execute a walking operation of the mimicry deformable robot in a target walking stage according to the target track, where the walking operation can avoid the mimicry deformable robot from colliding with the target obstacle information.

According to an embodiment of the disclosure, the obstacle avoidance apparatus may further include a second generation module and an update module.

And the second generation module is used for generating a target accumulated rewards according to the target track.

And the updating module is used for updating the target parameters in the obstacle avoidance model according to the target accumulated rewards.

Any number of the modules, units, or at least some of the functionality of any number of the modules, units, or units according to embodiments of the present disclosure may be implemented in one module. Any one or more of the modules, units according to embodiments of the present disclosure may be implemented as split into multiple modules. Any one or more of the modules, units, or the like according to embodiments of the present disclosure may be implemented at least in part as a hardware Circuit, such as a field programmable gate array (Field Programmable GATE ARRAY, FPGA), a programmable logic array (Programmable Logic Arrays, PLA), a system-on-a-chip, a system-on-a-substrate, a system-on-a-package, an Application SPECIFIC INTEGRATED Circuit (ASIC), or in any other reasonable manner of hardware or firmware that integrates or encapsulates the Circuit, or in any one of or in any suitable combination of three of software, hardware, and firmware. Or one or more of the modules, units according to embodiments of the present disclosure may be at least partially implemented as computer program modules which, when executed, may perform the corresponding functions.

It should be noted that, in the embodiment of the present disclosure, the device portion of the proximal strategy training obstacle avoidance model based on the strategy gradient corresponds to the method portion of the proximal strategy training obstacle avoidance model based on the strategy gradient in the embodiment of the present disclosure, and the description of the device portion of the proximal strategy training obstacle avoidance model based on the strategy gradient specifically refers to the method portion of the proximal strategy training obstacle avoidance model based on the strategy gradient, which is not described herein. Similarly, in the embodiments of the present disclosure, the obstacle avoidance device portion of the mimicry deformable robot corresponds to the obstacle avoidance method portion of the mimicry deformable robot, and the description of the obstacle avoidance device portion of the mimicry deformable robot specifically refers to the obstacle avoidance method portion of the mimicry deformable robot, which is not described herein.

Fig. 12 schematically shows a block diagram of an electronic device adapted to implement the method described above, according to an embodiment of the disclosure. The electronic device shown in fig. 12 is merely an example and should not be construed to limit the functionality and scope of use of the disclosed embodiments.

As shown in fig. 12, an electronic device 1200 according to an embodiment of the present disclosure may include a processor 1201, which may perform various appropriate actions and processes according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a random access Memory (Random Access Memory, RAM) 1203. The processor 1201 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or an associated chipset and/or special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), or the like. Processor 1201 may also include on-board memory for caching purposes. The processor 1201 may include a single processing unit or multiple processing units for performing the different actions of the method flows according to embodiments of the disclosure.

In the RAM 1203, various programs and data required for the operation of the electronic apparatus 1200 are stored. The processor 1201, the ROM 1202, and the RAM 1203 are connected to each other through a bus 1204. The processor 1201 performs various operations of the method flow according to the embodiments of the present disclosure by executing programs in the ROM 1202 and/or RAM 1203. Note that the program may be stored in one or more memories other than the ROM 1202 and the RAM 1203. The processor 1201 may also perform various operations of the method flow according to embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the disclosure, the electronic device 1200 may also include an input/output (I/O) interface 1205, the input/output (I/O) interface 1205 also being connected to the bus 1204. The system 1200 may also include one or more of the following components connected to the I/O interface 1205: an input section 1206 which may include a keyboard, mouse, or the like; an output portion 1207 which may include a Cathode Ray Tube (CRT), a Liquid crystal display (Liquid CRYSTAL DISPLAY, LCD), and a speaker, etc.; a storage section 1208 that may include a hard disk or the like; and a communication section 1209 that may include a network interface card such as a LAN card, modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. The drive 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1210 so that a computer program read out therefrom is installed into the storage section 1208 as needed.

According to embodiments of the present disclosure, the method flow according to embodiments of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure may include a computer program product, which may include a computer program embodied on a computer readable storage medium, the computer program containing program code for performing the methods shown in the flowcharts. In such an embodiment, the computer program can be downloaded and installed from a network via the communication portion 1209, and/or installed from the removable media 1211. The above-described functions defined in the system of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, devices, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

The present disclosure also provides a computer-readable storage medium that may be embodied in the apparatus/device/system described in the above embodiments; or may exist alone without being assembled into the apparatus/device/system. The computer-readable storage medium carries one or more programs which, when executed, implement methods in accordance with embodiments of the present disclosure.

According to embodiments of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-Only Memory (ROM), an erasable programmable read-Only Memory (EPROM) or flash Memory, a portable compact disc read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this disclosure, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

Embodiments of the present disclosure may also include a computer program product, which may include a computer program comprising program code for performing the methods provided by the embodiments of the present disclosure, for causing an electronic device to implement the methods of policy gradient based proximal policy training obstacle avoidance models or obstacle avoidance methods of a mimicry deformable robot provided by the embodiments of the present disclosure when the computer program product is run on the electronic device.

The above-described functions defined in the system/apparatus of the embodiments of the present disclosure are performed when the computer program is executed by the processor 1201. The systems, apparatus, modules, units, etc. described above may be implemented by computer program modules according to embodiments of the disclosure.

In one embodiment, the computer program may be based on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program can also be transmitted, distributed over a network medium in the form of signals, and downloaded and installed via a communication portion 1209, and/or from a removable medium 1211. The computer program may comprise program code that may be transmitted using any appropriate network medium, which may include, but is not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

According to embodiments of the present disclosure, program code for performing computer programs provided by embodiments of the present disclosure may be written in any combination of one or more programming languages, and in particular, such computer programs may be implemented in high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. The programming language may include, but is not limited to, a programming language such as Java, C++, python, "C" or the like. The program code may execute entirely on the user's computing device, partly on the user's device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected via the Internet using an Internet service provider).

The embodiments of the present disclosure are described above. These examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. Although the embodiments are described above separately, this does not mean that the measures in the embodiments cannot be used advantageously in combination. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be made by those skilled in the art without departing from the scope of the disclosure, and such alternatives and modifications are intended to fall within the scope of the disclosure.

Claims

1. A method of proximal strategy optimization training an obstacle avoidance model based on strategy gradients, the obstacle avoidance model applied to a mimicry deformable robot, the method comprising:

Generating a plurality of strategies according to environmental information for each training sample in a training sample set of each walking stage, wherein the training samples comprise environmental information comprising obstacle information acquired by using the mimicry deformable robot, each strategy comprises a track of walking of the mimicry deformable robot in an external environment, the track comprises a plurality of discrete actions and states corresponding to each action, the actions are generated by using a strategy function comprising initial parameters, and the actions comprise at least one of the following: a serpentine deformation action, a spherical deformation action, and a square or rectangular deformation action;

processing the tracks by using a reward and punishment function aiming at each strategy to obtain accumulated rewards of the tracks, wherein the reward and punishment functions are determined according to rewards corresponding to the initial parameters, and the reward and punishment functions and the strategy functions of different mimicry walking stages are different;

Determining expected rewards of the strategies according to a plurality of accumulated rewards corresponding to the strategies and probabilities corresponding to each track, wherein the probabilities represent the probabilities of the mimicry deformable robot selecting the track of the current strategy from the tracks corresponding to the strategies according to the current state;

updating the initial parameters according to the first derivative based on a strategy gradient algorithm to obtain target parameters of a trained obstacle avoidance model;

Controlling the mimicry deformable robot to walk in the environmental information of the walking stage by using the trained obstacle avoidance model, and under the condition that the mimicry deformable robot collides with the obstacle information, iteratively training the obstacle avoidance model by using other training samples to obtain target parameters of a new trained obstacle avoidance model;

Determining the trained obstacle avoidance model as a target obstacle avoidance model of the walking stage under the condition that the mimicry deformable robot does not collide with the obstacle information;

Wherein the strategy gradient algorithm comprises a gradient ascending method;

The strategy gradient algorithm updates the initial parameters according to the first derivative to obtain target parameters of a trained obstacle avoidance model, and the method comprises the following steps:

converting the first derivative by using a logarithmic function derivative formula to obtain a second derivative;

Determining a reward expected average value by utilizing a plurality of accumulated rewards corresponding to a plurality of training samples based on the gradient ascent method;

determining a third derivative from the bonus desired average and the second derivative;

Determining the target parameter according to the third derivative and the initial parameter;

Wherein, the reward and punishment function R is shown in formula (1):

Wherein θ represents an initial parameter, R _θ represents a cumulative prize corresponding to the initial parameter, R _θ =250 when D < D, R _θ = -150 when D > D, k being a discount factor; d represents the punishment distance of the mimicry deformable robot from the obstacle, and D represents the minimum distance of the mimicry deformable robot from the obstacle;

The first derivative As shown in formula (2), the logarithmic function derivative formula is shown in formula (3), the second derivativeThe third derivative/>, as shown in equation (4)As shown in formula (5):

Wherein τ represents a trajectory; r represents a cumulative prize; θ represents an initial parameter; p _θ represents probability; Representing a desired average of rewards; n represents the number of training samples; n represents the nth training sample; a represents an action; s represents a state; t represents the t-th; t represents the total number of actions or states.

2. The method of claim 1, wherein the generating a plurality of policies from the environmental information comprises:

generating a plurality of current actions according to the environment information by utilizing the strategy function;

Processing the current actions and states by using a cost function for each of the plurality of current actions to obtain a current value corresponding to the current actions;

and processing the current action and the current value by utilizing an advantage function to generate a next action.

3. The method of claim 1, wherein the updating the initial parameters according to the first derivative results in target parameters of a trained obstacle avoidance model, comprising:

Converting the first derivative by using a logarithmic function derivative formula to obtain a fourth derivative;

determining a fifth derivative according to the fourth derivative and the average value of rewards;

Performing weight optimization on the fifth derivative to obtain a sixth derivative;

performing parameter replacement processing on the sixth derivative to obtain a seventh derivative;

determining the target parameter according to the seventh derivative and the initial parameter;

the updating the initial parameters according to the first derivative to obtain target parameters of the trained obstacle avoidance model includes:

Converting the first derivative by using a logarithmic function derivative formula to obtain an eighth derivative;

Performing distribution conversion processing on the eighth derivative to obtain a ninth derivative;

Obtaining a first gradient calculated value according to the advantage function and the ninth derivative;

obtaining a second gradient calculated value added with importance samples according to the ninth derivative and the first gradient calculated value;

Simplifying the second gradient calculated value to obtain a target gradient calculated value;

determining the target parameter according to the target gradient calculated value and the initial parameter;

Wherein the ninth derivative The first gradient calculation value/>, as shown in equation (6)The second gradient calculation value/>, as shown in equation (7)The target gradient calculation value/>, as shown in equation (8)As shown in formula (9):

Wherein p _θ characterizes the probability corresponding to the strategy, p _θ′ characterizes the probability corresponding to a transition strategy pi _θ′ obtained by converting the strategy pi _θ, a ^θ′(s_t,a_t) characterizes a merit function corresponding to the transition strategy, E characterizes the reward.

4. A method according to claim 3, further comprising:

determining an initial expected rewarding function according to the target gradient calculated value and the target parameter, wherein the initial expected rewarding function comprises strategy distribution difference, the initial parameter and behavior strategy parameter;

Cutting the strategy distribution difference to obtain a target expected rewarding function;

processing the behavior strategy parameters based on a gradient ascent method so as to determine a maximum expected rewards value according to a track corresponding to a target strategy and the target expected rewards function;

Processing the initial parameters by using a gradient descent method based on the maximum expected reward value to obtain transition initial parameters;

Determining the transition initial parameter as a target initial parameter under the condition that the mean square error between the transition rewards and the transition value is smaller than a preset threshold, wherein the transition rewards are determined according to the reward and punishment function, the transition initial parameter and a transition strategy, the transition value is determined according to a cost function, the transition initial parameter and the transition strategy, and the transition strategy is obtained by converting the strategy;

and determining a new reward and punishment function according to the target initial parameters and the reward and punishment function.

5. An obstacle avoidance method of a mimicry deformable robot, comprising:

For each target walking stage, acquiring target environment information comprising target obstacle information acquired by a plurality of ultrasonic sensors of the mimicry deformable robot, and establishing a state space by determining the distance between the mimicry deformable robot and the obstacle;

Processing the target environment information by using a trained target obstacle avoidance model, and outputting a target track of the target walking stage, wherein the target track comprises a plurality of discrete target actions walking in the target environment and a target state corresponding to each target action, and the target actions comprise at least one of the following: a serpentine deformation action, a spherical deformation action, and a square or rectangular deformation action, the target obstacle avoidance model being trained using the method of any one of claims 1 to 4;

And the mimicry deformable robot executes the walking operation of the target walking stage according to the target track, wherein the walking operation can avoid collision between the mimicry deformable robot and the target obstacle information.

6. The obstacle avoidance method of claim 5, further comprising, prior to outputting the target trajectory:

Generating a target accumulated rewards according to the target track;

And updating target parameters in the obstacle avoidance model of the target walking stage according to the target accumulated rewards.