CN107450555A

CN107450555A - A kind of Hexapod Robot real-time gait planing method based on deeply study

Info

Publication number: CN107450555A
Application number: CN201710763223.7A
Authority: CN
Inventors: 唐开强; 刘佳生; 洪俊; 孙建; 侯跃南; 钱勇; 潘东旭
Original assignee: Individual
Current assignee: Individual
Priority date: 2017-08-30
Filing date: 2017-08-30
Publication date: 2017-12-08

Abstract

The invention provides a kind of Hexapod Robot real-time gait planing method based on deeply study, step includes：Environment traffic information is obtained by Hexapod Robot and formulates overall movement locus；The photo of environment is obtained by camera, calculates the traffic information of target trajectory using binocular distance-finding method further according to photo, and the track traffic information calculated is navigated for robot center of mass motion track；In the range of the sufficient end swing space of robot leg, the photo of road conditions environment is shot, and Data Dimensionality Reduction and feature extraction are carried out to photo by the deeply learning network based on depth deterministic policy gradient (DDPG) that training in advance is crossed；The control strategy of Hexapod Robot is drawn according to feature extraction result, Hexapod Robot falls foot according to control strategy come control machine people, realizes the real-time walking of Hexapod Robot.The method of the gait planning can be planned the complicated non-structure environment of road conditions in real time, significant to the adaptive capacity to environment of raising Hexapod Robot.

Description

A kind of Hexapod Robot real-time gait planing method based on deeply study

Technical field

It is especially a kind of to be learnt based on deeply the present invention relates to a kind of method of Hexapod Robot real-time gait planning Hexapod Robot real-time gait planing method.

Background technology

Robot technology be materialogy, theory of mechanisms, bionics, electromechanical integration technology, control technology, sensor technology, The subjects such as artificial intelligence it is highly integrated, be the important embodiment of National Industrial development level and scientific and technological strength.It is autonomous to complete gait The polypody bio-robot of planning is highly intelligentized mobile robot, and the autonomous learning of external environment and completion can be walked State is planned.Road conditions environment complexity is various, and the gait planning method of Hexapod Robot tradition preprogramming has significant limitation. In order to improve the adaptive capacity to environment of Hexapod Robot, Hexapod Robot needs to complete various basic job tasks such as overall The function that mobile navigation, the planning of barycenter motion track and foothold are chosen.Satellite navigation and more biographies are merged by multi-foot robot The information of sensor carries out machine learning (such as deep learning and intensified learning), in particular how is interacted with external environment Improve the performance of target in empirical learning, realize the various functions such as its perception, decision-making and action.The correlation of Hexapod Robot is ground Study carefully the concern for enjoying various countries experts and scholars always, but how to improve locomotivity of the Hexapod Robot under non-structure environment still It is so a pendent problem.

The content of the invention

The technical problem to be solved in the present invention is the ground that existing Hexapod Robot gait planning technology can not adapt to complexity Shape environment and remote autonomous walking and the unfixed situation in final position.

In order to solve the above-mentioned technical problem, the invention provides a kind of Hexapod Robot based on deeply study is real-time Gait planning method, comprises the following steps：

Step 1, environment traffic information is obtained by satellite map by Hexapod Robot, and formulated according to environment traffic information Mass motion track；

Step 2, Hexapod Robot utilizes the camera being arranged on fuselage to obtain surrounding enviroment photo, further according to peripheral ring Border photo calculates the target position information of movement locus using binocular distance-finding method, and by Hexapod Robot according to movement locus Target position information cook up robot center of mass motion track；

Step 3, Hexapod Robot moves according to robot center of mass motion track, and is swung at the sufficient end of robot leg In spatial dimension, using on fuselage camera shoot road conditions environment photo, and by training in advance cross based on DDPG deeply learning network to carry out Data Dimensionality Reduction and feature extraction to road conditions environment photo；

Step 4, Hexapod Robot draws the control strategy of Hexapod Robot according to Data Dimensionality Reduction and feature extraction result, and The each joint driving mechanism of Hexapod Robot is controlled to complete joint freedom degrees motion according to control strategy, so as to realize six sufficient machines The real-time gait planning walking of device.

As the further limits scheme of the present invention, motion is calculated using binocular distance-finding method according to photo in step 2 The real-time position information of track concretely comprises the following steps：

Step 2.1, focal length f, the centre-to-centre spacing T of two cameras in left and right of camera are obtained_xAnd movement locus in road conditions On target point two cameras in left and right image plane subpoint to the respective image plane leftmost side physical distance x^lAnd x^r, The image plane in left side corresponding to the camera of left and right two and the image plane on right side are rectangle plane, and are located at same imaging plane On, the photocentre projection of the camera of left and right two is located at the center of corresponding image plane respectively, i.e. O_l、O_rIn the projection of imaging plane Point, then parallax d be：

D=x^l-x^r (1)

Step 2.2, establishing Q matrixes using Similar Principle of Triangle is：

In formula (2) and (3), (X, Y, Z) is target point using left camera photocentre as the seat in the three-dimensional coordinate system of origin Mark, W are rotation translation conversion ratio example coefficient, and (x, y) is coordinate of the target point in the image plane in left side, c_xAnd c_yIt is respectively left The offset of origin, c in the coordinate system and three-dimensional coordinate system of the image plane of side and the image plane on right side_x' it is c_xCorrection value；

Step 2.3, the space length that target point to imaging plane is calculated is：

Using the photocentre position of left camera as robot position, by the co-ordinate position information of target point (X, Y, Z) target position information as movement locus.

As the further limits scheme of the present invention, the deeply based on DDPG crossed by training in advance in step 3 Learning network concretely comprises the following steps to carry out Data Dimensionality Reduction and feature extraction to road conditions environment photo：

Step 3.1, using target foot end, independently selection point process of stopping over meets intensified learning and meets Markov property Condition, calculate t before observed quantity and action collection be combined into：

s_t=(x₁,a₁,...,a_t-1,x_t)=x_t (5)

In formula (5), x_tAnd a_tThe respectively observed quantity of t and the action taken；

Step 3.2, Utilization strategies value function come describe sufficient end independently selection stop over point process prospective earnings for：

Q^π(s_t,a_t)=E [R_t|s_t,a_t] (6)

In formula (6),For moment t obtain to beat the later future profits of discount total It is discount factor with, γ ∈ [0,1], r (s_t,a_t) be moment t revenue function, T be sufficient end independently select that foothold terminates when Carve, π is that sufficient end independently selects foothold strategy；

It is default determination because sufficient end independently selects the target strategy π of foothold, is designated as function mu:S ← A, S are state Space, A are the motion space of N-dimensional degree, while had using Bellman equation processing formula (6)：

Wherein, s_t+1~E represents that the observed quantity at t+1 moment obtains from environment E, μ (s_t+1) represent the t+1 moment from Measure the action being be mapped to by function mu；

Step 3.3, using the principle of maximal possibility estimation, it is to update network weight parameter by minimizing loss function θ^QPolicy evaluation network Q (s, a | θ^Q), used loss function is：

L(θ^Q)=E_μ'[(Q(s_t,a_t|θ^Q)-y_t)²] (8)

In formula (8), y_t=r (s_t,a_t)+γQ(s_t+1,μ(s_t+1)|θ^Q) it is that target strategy assesses network, μ ' is target plan Slightly；

Step 3.4, the parameter for reality is θ^μStrategic function μ (s | θ^μ), the gradient obtained using chain method is：

The gradient being calculated by formula (9) is Policy-Gradient, recycle Policy-Gradient come update strategic function μ (s | θ^μ)；

Step 3.5, using the sample data come training network, used from policing algorithm in network training from same sample Obtained in buffering area, to minimize the relevance between sample, while neutral net is trained with a target Q value network, i.e., Objective network is updated using experience replay mechanism and target Q value network method, used slowly more new strategy is：

θ^Q'←τθ^Q+(1-τ)θ^Q' (10)

θ^μ'←τθ^μ+(1-τ)θ^μ' (11)

In formula (10) and (11), τ is turnover rate, τ<<1, thus just construct the deeply study based on DDPG Network, and be convergent neutral net；

Step 3.6, Data Dimensionality Reduction and feature are carried out to road conditions environment photo using the deeply learning network built Extraction.

As the further limits scheme of the present invention, the deeply learning network in step 3.6 is inputted by two images Layer, four convolutional layers, four full articulamentums and an output layer are formed；Image input layer is used to input independently to be selected for sufficient end Select the image of foothold；Convolutional layer is used to extract characteristics of image, i.e., the deep layer form of expression of two images；Full articulamentum and output Layer is combined to form a deep layer network, after completing training, input feature vector terrain information to the exportable each joint of the network Angle control instruction, that is, each joint driving mechanism of Hexapod Robot leg is controlled to complete joint freedom degrees motion, so as to real The real-time walking of existing six sufficient machines.

The beneficial effects of the present invention are：(1) satellite navigation is applied to Hexapod Robot center of mass motion trajectory planning, made Hexapod Robot can complete remote autonomous walking gait planning.(2) binocular ranging side is utilized by way of binocular ranging Method calculates the positional information of movement locus, and the track traffic information calculated is used for into leading for robot center of mass motion track Boat, realize closely gait planning.(3) dual image input layer can effectively move planning and the variation targets point of track Determination, and the image information of input, input are determined in pre-training neutral net using stochastical sampling and experience replay mechanism Image information it is not only independent of one another but also interrelated, meet neutral net for input data requirement independent of one another；(4) mesh is used Mark Q values network technique constantly to adjust the weight matrix of neutral net, realizes Data Dimensionality Reduction, promotes neutral net convergence；(5) lead to Cross the photo progress data drop that the deeply learning network based on DDPG that training in advance is crossed shoots road conditions environment to camera Peacekeeping Extraction of Topographic Patterns, and the gait motion control strategy of Hexapod Robot is directly given, effectively solve " dimension disaster " and ask Topic, realize the real-time gait planning of Hexapod Robot.

Brief description of the drawings

Fig. 1 is the system structure diagram of the present invention；

Fig. 2 is flow chart of the method for the present invention；

Embodiment

A kind of as shown in figure 1, Hexapod Robot real-time gait planing method fortune based on deeply study of the present invention Capable system includes：Satellite navigation system, machine vision and image processing system, central control system and basic motion system.

Wherein, satellite navigation system is mainly made up of the Satellite Map GIS Software in Hexapod Robot, in input mesh Ground after, path planning can be quickly completed, and by the information transmission of path planning to central control system；Image processing system Mainly have installed in the anterior camera of Hexapod Robot and the matlab software sharings on industrial computer；Maincenter control system System is mainly based on the depth of depth deterministic policy gradient (DDPG) by the dynamics simulation platform pre-training on industrial computer Spend intensified learning network and communication module to form, in the process generally and target Q value network both approaches ensure base It can be restrained during pre-training in DDPG deeply learning network.Basic motion system is tied by the machinery of Hexapod Robot Structure, driving and sensor are formed, and perform the Motion that central control system is formulated, each pass of control Hexapod Robot leg Motor or oil cylinder of driving etc. are saved, the motion of joint freedom degrees is completed, so as to realize the real-time walking of Hexapod Robot, and moves Feedback of the information is to central control system.

In the certain distance of Hexapod Robot fuselage, environment is obtained by the camera installed in Hexapod Robot fuselage Photo, the positional information of movement locus, and the track road that will be calculated are calculated using binocular distance-finding method further according to photo Condition information is used for the navigation of robot center of mass motion track；

During pre-training neutral net, first by the matlab softwares on industrial computer by environment traffic information RGB image is converted into gray level image, utilizes experience replay mechanism so that the degree of correlation is as small as possible to meet nerve net before and after photo Network recycles stochastical sampling to obtain the image of input neutral net for input data requirement independent of each other；Pass through depth Data Dimensionality Reduction is realized in study, and the weight matrix of neutral net is constantly adjusted using target Q value network technique, is finally given convergent Neutral net.

Hexapod Robot moves according to the motion planning of centroid trajectory, in the sufficient end swing space scope of robot leg It is interior, the photo of the camera shooting road conditions environment of Hexapod Robot is now recycled, it is true based on depth by having trained The deeply learning network of qualitative Policy-Gradient realizes Data Dimensionality Reduction and extraction feature and provides the real-time step of Hexapod Robot State programming movement strategy, finally control strategy is sent motion of the basic motion system come control machine people to by communication system State, realize that sufficient end on movement locus independently selects the real-time control of foothold.

At work, step is as follows for system：

Step 1, start the Satellite Map GIS Software being arranged in Hexapod Robot, input robot motion destination and complete Path planning, and by the information transmission of path planning to central control system；

Step 2, it is based on depth deterministic policy gradient by the dynamics simulation platform on industrial computer, pre-training (DDPG) deeply learning network, ensured using experience replay mechanism and target Q value network both approaches based on deep The deeply learning network for spending deterministic policy gradient can Fast Convergent during pre-training；

Step 3, the photograph in the certain distance of Hexapod Robot fuselage is obtained with installed in the anterior camera of robot The image of piece environment traffic information, image information is transmitted to industrial computer using communication module, binocular ranging is utilized further according to photo Method calculates the positional information of movement locus, and the track traffic information calculated is used for into robot center of mass motion track Navigation；

Step 4, Hexapod Robot moves according to the motion planning of centroid trajectory, is swung at the sufficient end of robot leg empty Between in the range of, recycle Hexapod Robot camera shoot road conditions environment photo, and by training in advance cross based on DDPG deeply learning network to carry out Data Dimensionality Reduction and feature extraction to the environment photo of acquisition；

Step 5, the control strategy of Hexapod Robot is drawn according to feature extraction result, using communication module by control information Send the basic motion system of robot to, Hexapod Robot controls each pass of Hexapod Robot leg using control strategy Motor or oil cylinder of driving etc. are saved, the motion of joint freedom degrees is completed, so as to realize the real-time walking of six sufficient machines, and moves letter Breath feeds back to central control system.

As shown in Fig. 2 the invention provides a kind of Hexapod Robot real-time gait planning side based on deeply study Method, comprise the following steps：

D=x^l-x^r (1)

Step 2.2, establishing Q matrixes using Similar Principle of Triangle is：

In formula (2) and (3), (X, Y, Z) is target point using left camera photocentre as the seat in the three-dimensional coordinate system of origin Mark, W are rotation translation conversion ratio example coefficient, and (x, y) is coordinate of the target point in the image plane in left side, c_xAnd c_yIt is respectively left The offset of origin, c in the coordinate system and three-dimensional coordinate system of the image plane of side and the image plane on right side_x' it is c_xCorrection value, two Person's numerical value is typically more or less the same, for convenience in the present invention it is considered that both approximately equals；

s_t=(x₁,a₁,...,a_t-1,x_t)=x_t (5)

Q^π(s_t,a_t)=E [R_t|s_t,a_t] (6)

L(θ^Q)=E_μ'[(Q(s_t,a_t|θ^Q)-y_t)²] (8)

θ^Q'←τθ^Q+(1-τ)θ^Q' (10)

θ^μ'←τθ^μ+(1-τ)θ^μ' (11)

As the further limits scheme of the present invention, the deeply learning network in step 3.6 is inputted by two images Layer, four convolutional layers, four full articulamentums and an output layer are formed；Wherein, the reason for image input layer is two is will Determine that movement locus and phase targets are determined, the reason for quantity of convolutional layer and full articulamentum is four is that extraction image should be made special Sign is effective, needs to make neutral net Fast Convergent in the training process again；Image input layer is used to input independently to be selected for sufficient end Select the image of foothold；Convolutional layer is used to extract characteristics of image, i.e., the deep layer form of expression of two images, such as some point, line, arcs Deng；Full articulamentum and output layer are combined to form a deep layer network, after completing training, input feature vector terrain information to the network The angle control instruction in exportable each joint, that is, each joint driving mechanism of Hexapod Robot leg is controlled to complete joint certainly Moved by degree, so as to realize the real-time walking of six sufficient machines.

Satellite navigation is applied to Hexapod Robot center of mass motion trajectory planning by the present invention, Hexapod Robot is completed Remote autonomous walking gait planning.The position of movement locus is calculated using binocular distance-finding method by way of binocular ranging Information, and the track traffic information calculated is used for the navigation of robot center of mass motion track, realize closely gait planning. Dual image input layer can effectively move the planning of track and the determination of variation targets point, and in pre-training neutral net Shi Caiyong stochastical samplings and the image information of experience replay mechanism determination input, the image information of input are not only independently of one another but also mutual Association, meets neutral net for input data requirement independent of one another；Neutral net is constantly adjusted using target Q value network technique Weight matrix, realize Data Dimensionality Reduction, promote neutral net convergence；The deeply based on DDPG crossed by training in advance The photo that learning network shoots road conditions environment to camera carries out Data Dimensionality Reduction and Extraction of Topographic Patterns, and directly gives six sufficient machines The gait motion control strategy of device people, effectively solve " dimension disaster " problem, realize the real-time gait planning of Hexapod Robot.

Claims

1. a kind of Hexapod Robot real-time gait planing method based on deeply study, it is characterised in that including following step Suddenly：

Step 1, environment traffic information is obtained by satellite map by Hexapod Robot, and entirety is formulated according to environment traffic information Movement locus；

Step 2, Hexapod Robot utilizes the camera being arranged on fuselage to obtain surrounding enviroment photo, is shone further according to surrounding enviroment Piece calculates the target position information of movement locus, and the mesh by Hexapod Robot according to movement locus using binocular distance-finding method Cursor position information planning goes out robot center of mass motion track；

Step 3, Hexapod Robot moves according to robot center of mass motion track, and in the sufficient end swing space of robot leg In the range of, using on fuselage camera shoot road conditions environment photo, and by training in advance cross based on DDPG's Deeply learning network to carry out Data Dimensionality Reduction and feature extraction to road conditions environment photo；

Step 4, Hexapod Robot draws the control strategy of Hexapod Robot according to Data Dimensionality Reduction and feature extraction result, and according to Control strategy moves to control each joint driving mechanism of Hexapod Robot to complete joint freedom degrees, so as to realize six sufficient machines Real-time gait planning walking.

2. the Hexapod Robot real-time gait planing method according to claim 1 based on deeply study, its feature It is, the real-time position information for being calculated movement locus in step 2 using binocular distance-finding method according to photo is concretely comprised the following steps：

Step 2.1, focal length f, the centre-to-centre spacing T of two cameras in left and right of camera are obtained_xAnd the mesh in road conditions on movement locus Punctuate the image plane of two cameras in left and right subpoint to the respective image plane leftmost side physical distance x^lAnd x^r, left and right two The image plane in left side corresponding to individual camera and the image plane on right side are rectangle plane, and on same imaging plane, it is left The photocentre projection of right two cameras is located at the center of corresponding image plane respectively, i.e. O_l、O_rIn the subpoint of imaging plane, then Parallax d is：

D=x^l-x^r (1)

Step 2.2, establishing Q matrixes using Similar Principle of Triangle is：

<mrow> <mi>Q</mi> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mn>1</mn> </mtd> <mtd> <mn>0</mn> </mtd> <mtd> <mn>0</mn> </mtd> <mtd> <mrow> <mo>-</mo> <msub> <mi>c</mi> <mi>x</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <mn>1</mn> </mtd> <mtd> <mn>0</mn> </mtd> <mtd> <mrow> <mo>-</mo> <msub> <mi>c</mi> <mi>y</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <mn>0</mn> </mtd> <mtd> <mn>0</mn> </mtd> <mtd> <mi>f</mi> </mtd> </mtr> <mtr> <mtd> <mn>0</mn> </mtd> <mtd> <mn>0</mn> </mtd> <mtd> <mrow> <mo>-</mo> <mfrac> <mn>1</mn> <msub> <mi>T</mi> <mi>x</mi> </msub> </mfrac> </mrow> </mtd> <mtd> <mfrac> <mrow> <msub> <mi>c</mi> <mi>x</mi> </msub> <mo>-</mo> <msup> <msub> <mi>c</mi> <mi>x</mi> </msub> <mo>&prime;</mo> </msup> </mrow> <msub> <mi>T</mi> <mi>x</mi> </msub> </mfrac> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>2</mn> <mo>)</mo> </mrow> </mrow>

<mrow> <mi>Q</mi> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mi>x</mi> </mtd> </mtr> <mtr> <mtd> <mi>y</mi> </mtd> </mtr> <mtr> <mtd> <mi>d</mi> </mtd> </mtr> <mtr> <mtd> <mn>1</mn> </mtd> </mtr> </mtable> </mfenced> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mrow> <mi>x</mi> <mo>-</mo> <msub> <mi>c</mi> <mi>x</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mi>y</mi> <mo>-</mo> <msub> <mi>c</mi> <mi>y</mi> </msub> </mrow> </mtd> </mtr> <mtr> <mtd> <mi>f</mi> </mtd> </mtr> <mtr> <mtd> <mfrac> <mrow> <mo>-</mo> <mi>d</mi> <mo>+</mo> <msub> <mi>c</mi> <mi>x</mi> </msub> <mo>-</mo> <msup> <msub> <mi>c</mi> <mi>x</mi> </msub> <mo>&prime;</mo> </msup> </mrow> <msub> <mi>T</mi> <mi>x</mi> </msub> </mfrac> </mtd> </mtr> </mtable> </mfenced> <mo>=</mo> <mfenced open = "[" close = "]"> <mtable> <mtr> <mtd> <mi>X</mi> </mtd> </mtr> <mtr> <mtd> <mi>Y</mi> </mtd> </mtr> <mtr> <mtd> <mi>Z</mi> </mtd> </mtr> <mtr> <mtd> <mi>W</mi> </mtd> </mtr> </mtable> </mfenced> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>3</mn> <mo>)</mo> </mrow> </mrow>

In formula (2) and (3), (X, Y, Z) is target point using left camera photocentre as the coordinate in the three-dimensional coordinate system of origin, W Conversion ratio example coefficient is translated for rotation, (x, y) is coordinate of the target point in the image plane in left side, c_xAnd c_yOn the left of respectively The offset of origin, c in the coordinate system and three-dimensional coordinate system of image plane and the image plane on right side_x' it is c_xCorrection value；

<mrow> <mi>Z</mi> <mo>=</mo> <mfrac> <mrow> <mo>-</mo> <msub> <mi>T</mi> <mi>x</mi> </msub> <mi>f</mi> </mrow> <mrow> <mi>d</mi> <mo>-</mo> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>x</mi> </msub> <mo>-</mo> <msup> <msub> <mi>c</mi> <mi>x</mi> </msub> <mo>&prime;</mo> </msup> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>4</mn> <mo>)</mo> </mrow> </mrow> 1

Using the photocentre position of left camera as robot position, by the co-ordinate position information (X, Y, Z) of target point Target position information as movement locus.

3. the Hexapod Robot real-time gait planing method according to claim 1 or 2 based on deeply study, it is special Sign is that the deeply learning network based on DDPG crossed in step 3 by training in advance is come to the progress of road conditions environment photo Data Dimensionality Reduction and feature extraction concretely comprise the following steps：

Step 3.1, using target foot end, independently selection point process of stopping over meets intensified learning and meets the bar of Markov property Part, the collection of observed quantity and action before calculating t are combined into：

s_t=(x₁,a₁,...,a_t-1,x_t)=x_t (5)

Q^π(s_t,a_t)=E [R_t|s_t,a_t] (6)

In formula (6),The later future profits summation of discount, γ were beaten for what moment t was obtained ∈ [0,1] is discount factor, r (s_t,a_t) be moment t revenue function, at the time of T is that sufficient end independently selects the foothold to terminate, π Foothold strategy is independently selected for sufficient end；

It is default determination because sufficient end independently selects the target strategy π of foothold, is designated as function mu:S ← A, S are state space, A is the motion space of N-dimensional degree, while had using Bellman equation processing formula (6)：

<mrow> <msup> <mi>Q</mi> <mi>&mu;</mi> </msup> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>a</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <msub> <mi>E</mi> <mrow> <msub> <mi>s</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>~</mo> <mi>E</mi> </mrow> </msub> <mo>&lsqb;</mo> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>,</mo> <msub> <mi>a</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> <mo>+</mo> <msup> <mi>&gamma;Q</mi> <mi>&mu;</mi> </msup> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <mi>&mu;</mi> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>t</mi> <mo>+</mo> <mn>1</mn> </mrow> </msub> <mo>)</mo> <mo>)</mo> </mrow> <mo>&rsqb;</mo> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>7</mn> <mo>)</mo> </mrow> </mrow>

Wherein, s_t+1~E represents that the observed quantity at t+1 moment obtains from environment E, μ (s_t+1) represent that the t+1 moment leads to from observed quantity Cross the action that function mu is be mapped to；

Step 3.3, using the principle of maximal possibility estimation, by minimizing loss function to update network weight parameter be θ^Q's Policy evaluation network Q (s, a | θ^Q), used loss function is：

L(θ^Q)=E_μ'[(Q(s_t,a_t|θ^Q)-y_t)²] (8)

In formula (8), y_t=r (s_t,a_t)+γQ(s_t+1,μ(s_t+1)|θ^Q) it is that target strategy assesses network, μ ' is target strategy；

<mrow> <mtable> <mtr> <mtd> <mrow> <msub> <mo>&dtri;</mo> <msup> <mi>&theta;</mi> <mi>&mu;</mi> </msup> </msub> <mi>&mu;</mi> <mo>&ap;</mo> <msub> <mi>E</mi> <msup> <mi>&mu;</mi> <mo>&prime;</mo> </msup> </msub> <mo>&lsqb;</mo> <msub> <mo>&dtri;</mo> <msup> <mi>&theta;</mi> <mi>&mu;</mi> </msup> </msub> <mi>Q</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>,</mo> <mi>a</mi> <mo>|</mo> <msup> <mi>&theta;</mi> <mi>Q</mi> </msup> <mo>)</mo> </mrow> <msub> <mo>|</mo> <mrow> <mi>s</mi> <mo>=</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>a</mi> <mo>=</mo> <mi>&mu;</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>|</mo> <msup> <mi>&theta;</mi> <mi>&mu;</mi> </msup> <mo>)</mo> </mrow> </mrow> </msub> <mo>&rsqb;</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mrow> <mo>=</mo> <msub> <mi>E</mi> <msup> <mi>&mu;</mi> <mo>&prime;</mo> </msup> </msub> <mo>&lsqb;</mo> <msub> <mo>&dtri;</mo> <mi>a</mi> </msub> <mi>Q</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>,</mo> <mi>a</mi> <mo>|</mo> <msup> <mi>&theta;</mi> <mi>Q</mi> </msup> <mo>)</mo> </mrow> <msub> <mo>|</mo> <mrow> <mi>s</mi> <mo>=</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>,</mo> <mi>a</mi> <mo>=</mo> <mi>&mu;</mi> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> <mo>)</mo> </mrow> </mrow> </msub> <msub> <mo>&dtri;</mo> <msup> <mi>&theta;</mi> <mi>&mu;</mi> </msup> </msub> <mi>&mu;</mi> <mrow> <mo>(</mo> <mi>s</mi> <mo>|</mo> <msup> <mi>&theta;</mi> <mi>&mu;</mi> </msup> <mo>)</mo> </mrow> <msub> <mo>|</mo> <mrow> <mi>s</mi> <mo>=</mo> <msub> <mi>s</mi> <mi>t</mi> </msub> </mrow> </msub> <mo>&rsqb;</mo> </mrow> </mtd> </mtr> </mtable> <mo>-</mo> <mo>-</mo> <mo>-</mo> <mrow> <mo>(</mo> <mn>9</mn> <mo>)</mo> </mrow> </mrow>

Step 3.5, using the sample data come training network, used from policing algorithm in network training from same Sample Buffer Obtained in area, to minimize the relevance between sample, while neutral net is trained with a target Q value network, that is, used Experience replay mechanism and target Q value network method are updated to objective network, and used slowly more new strategy is：

θ^Q'←τθ^Q+(1-τ)θ^Q' (10)

θ^μ'←τθ^μ+(1-τ)θ^μ' (11)

In formula (10) and (11), τ is turnover rate, τ<<1, a deeply learning network based on DDPG is thus just constructed, And it is convergent neutral net；

Step 3.6, Data Dimensionality Reduction and feature extraction are carried out to road conditions environment photo using the deeply learning network built.

4. the Hexapod Robot real-time gait planing method according to claim 3 based on deeply study, its feature Be, the deeply learning network in step 3.6 by two image input layers, four convolutional layers, four full articulamentums and One output layer is formed；Image input layer, which is used to input, is used for the image that sufficient end independently selects foothold；Convolutional layer is used to extract Characteristics of image, i.e., the deep layer form of expression of two images；Full articulamentum and output layer are combined to form a deep layer network, are completed After training, input feature vector terrain information controls Hexapod Robot to the angle control instruction in the exportable each joint of the network Each joint driving mechanism of leg completes joint freedom degrees motion, so as to realize the real-time walking of six sufficient machines.