CN112494282B

CN112494282B - Exoskeleton main assistance parameter optimization method based on deep reinforcement learning

Info

Publication number: CN112494282B
Application number: CN202011383180.8A
Authority: CN
Inventors: 孙磊; 陈鑫; 董恩增; 佟吉刚; 李云飞; 曾德添; 龚欣翔; 李成辉
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2023-05-02
Anticipated expiration: 2040-12-01
Also published as: CN112494282A

Abstract

The invention discloses an optimization method of exoskeleton main assistance parameters based on deep reinforcement learning, which adopts a compound sinusoidal exoskeleton assistance curve equation to determine the exoskeleton main assistance parameters, utilizes a depth deterministic strategy gradient method in the deep reinforcement learning to solve the problem of flexible exoskeleton continuity control, builds a strategy network and an evaluation network, collects and processes hip joint flexion angle information of an exoskeleton wearer in real time, is used for generating a parameter training data set, carries out training optimization of the exoskeleton main assistance parameters, and realizes self-adaptive optimization of the exoskeleton main assistance parameters.

Description

Exoskeleton main assistance parameter optimization method based on deep reinforcement learning

(one) technical field:

the invention relates to the technical field of robots, in particular to an exoskeleton main assistance parameter optimization method based on deep reinforcement learning.

(II) background art:

for traditional lower limb rehabilitation training, the method is guided by a professional doctor and completed with the assistance of nurses or family members, and the method has the advantages of long time consumption, low effect and high labor intensity. In order to reduce the manpower burden and realize high-efficiency rehabilitation service, the gait rehabilitation flexible exoskeleton is widely applied.

The gait rehabilitation flexible exoskeleton combines the intelligent robot technology and the rehabilitation medical theory, can replace a professional doctor, and helps a patient to complete lower limb rehabilitation training. The appearance of the traditional Chinese medicine provides a new choice for rehabilitation of patients with lower limb dysfunction, and makes up for the deficiency of clinical treatment of patients with lower limb dysfunction.

The gait rehabilitation flexible exoskeleton is treated by fixing the lower limb of a patient and the exoskeleton together through a flexible band. The exoskeleton drives the lower limbs of the patient to complete various set rehabilitation training actions and stimulates the nerve control system of the joints and muscles of the lower limbs of the human body, so that the function of recovering the movement of the lower limbs of the patient is realized. The service object of the gait rehabilitation flexible exoskeleton determines that the gait rehabilitation flexible exoskeleton needs good comfort and adaptability, can bring better rehabilitation experience to patients and can be suitable for lower limb dysfunction patients of different people. Therefore, the power assisting parameter optimization of how to realize deep reinforcement learning of the exoskeleton is one of the core technologies of comfort and reliability of gait rehabilitation flexible exoskeleton.

The traditional lower limb rehabilitation training is guided by a professional doctor and is finished with the assistance of nurses or family members, and the mode is long in time consumption, low in effect and high in labor intensity. In the rehabilitation treatment of patients with lower limb dysfunction, a series of rehabilitation training of continuous actions is needed to be carried out on the patients, because the lower limb conditions of the patients are different, accurate assistance is needed to the patients, and the action is ended when the leg of the patients does not get the appointed gesture due to the fact that the assistance is too small, so that the lower limb rehabilitation training effect is poor; too much assistance can cause excessive stretching of the patient's legs, easily leading to secondary injury and unnecessary injury to the patient.

In a PID (proportional-integral-derivative control) control method in a conventional power assisting method, a control amount is calculated by using a linear combination of three terms of proportional, integral and derivative according to a system error to assist. While the PID algorithm is widely used in present day control due to its advantages of simple principle, easy parameter adjustment, etc., it may cause unexpected results in some cases. If the desired value differs too much from the actual value, the motor will generate too high a speed to reach the desired value, which will often lead to overshoot and oscillation phenomena, which is quite dangerous for gait rehabilitation flexible exoskeletons. In addition, the method cannot realize the optimization of the power assisting parameters, so that the efficiency is low, and the parameters still have larger errors.

Aiming at the defects of the prior art, a method for optimizing main exoskeleton assistance parameters based on deep reinforcement learning is needed to solve the problem of continuous assistance of gait rehabilitation flexible exoskeleton.

(III) summary of the invention:

the invention aims to provide a deep reinforcement learning-based exoskeleton main assistance parameter optimization method, which can overcome the defects of the prior art, is a parameter optimization method with simple principle and easy realization of technology, and can cope with the gait rehabilitation flexible exoskeleton continuous assistance problem, thereby effectively solving the personalized matching problem of the exoskeleton.

The technical scheme of the invention is as follows: the exoskeleton main assistance parameter optimization method based on deep reinforcement learning is characterized by comprising the following steps of:

(1) Determining optimization parameters;

determining optimization parameters according to an exoskeleton boosting curve equation, wherein the curve equation is in a compound sinusoidal form shown in a formula (1):

in the formula ,F_assist For the real-time power assisting, A is swing phase power assisting amplitude, t ^* Is the time between the current time and the power-assisting starting time, T _b For the swing phase period of the current gait cycle, alpha is an exoskeleton main assistance parameter, and serves as a waveform control parameter of a formula (1) to change the assistance peak position, wherein the value range is-1;

the swing phase power-assisted amplitude A is determined by the rated output value of the power-assisted component, and the swing phase power-assisted amplitude is a known value and can be set manually under the rated operation of the power-assisted component.

Swing phase period T of the current gait cycle _b The hip joint buckling angle parameters of the wearer during walking are acquired by utilizing MEMS (Micro-Electro-Mechanical System ) attitude sensors to acquire a buckling angle parameter curve of the hip joint of the wearer, and the first three of the buckling angle parameter curves are adopted The method for averaging the swing phase period of the next gait from the swing phase periods is characterized in that the swing phase period of the next gait is averaged from the previous three swing phase periods and is taken as the swing phase period of the current gait period. Thus, the swing phase period of the current gait cycle corresponds to a known value, obtained by equation (2):

swing phase period T of the current gait cycle _b The specific calculation method of (2) is as follows:

the MEMS attitude sensor is placed at the middle position of the rear parts of the left thigh and the right thigh of a wearer of the flexible exoskeleton robot, hip joint flexion angle parameters of the wearer during normal walking of the wearer are acquired in real time, so as to acquire a flexion angle parameter curve of the hip joint of the wearer, and the peak moment is recorded as t _{Wave crest} The trough moment is recorded as t _{Trough of wave} And the hip joint buckling angle corresponding to the peak hip joint buckling angle and the trough hip joint buckling angle are recorded, and the current gait cycle shown in the formula (3) and the swing phase cycle of the gait cycle shown in the formula (4) can be further calculated as follows:

T(k)＝t _{trough of wave} (k)-t _{Trough of wave} (k-1) (3)

T _b (k)＝t _{Wave crest} (k)-t _{Trough of wave} (k) (4)

Wherein, the expression (3) indicates that the current gait cycle is calculated by the values of two adjacent wave valley points, wherein T is the current gait cycle; the swing phase period of the gait cycle is calculated by the values of adjacent wave peak points and wave trough points;

Correspondingly, the maximum hip joint buckling angle theta corresponding to the current gait cycle can be obtained _max (k) Minimum hip flexion angle θ _min (k) For the state of the exoskeleton at the initial time in step (5) and the state of the exoskeleton at the next time in step (8).

The method for acquiring the flexion angle parameter curve of the hip joint of the wearer in the step (1) comprises the following steps:

(1-1) acquiring hip joint buckling angle parameter signals of a wearer of the flexible exoskeleton robot by using an MEMS attitude sensor, converting the hip joint buckling angle parameter signals into digital quantity signals, transmitting the digital quantity signals to a singlechip, and transmitting the digital quantity signals to a PC (Personal Computer ) end by using the singlechip;

the data transmission between the singlechip and the PC end in the step (1-1) is that the singlechip transmits to the PC end through serial communication and a Bluetooth module by utilizing a wireless network.

(1-2) obtaining a hip joint buckling angle parameter signal by utilizing a serial port interface in MATLAB installed at the PC end, and drawing a real-time curve of the hip joint buckling angle parameter through a plot function.

(2) Setting parameters:

setting the walking time interval of each walking time of the exoskeleton wearer to be tau=5-7 s, properly increasing the time interval, ensuring that the exoskeleton wearer can walk at least 3 steps for acquiring the swing phase period of the current gait cycle, enabling the exoskeleton wearer to stably stand when finishing each walking time interval, and judging the boosting condition again after each advancing; presetting a maximum scenario number E, a batch sampling number N and a maximum time wheel T of each scenario _max ；

The setting of the maximum scenario number E in the step (2) refers to setting the convergence number of optimizing the exoskeleton main assistance parameter α by using the deep reinforcement learning method, that is: the primary scenario corresponds to convergence of the primary parameters; said setting maximum time per episode wheel T _max The number of rounds to be performed under each scenario is set, and each round corresponds to the number of time intervals, namely: the maximum convergence of the main assistance parameter alpha of the exoskeleton is needed to be completed at each time _max Wheels, each wheel requiring a time interval for the exoskeleton wearer to walk τ; and, starting the number of rounds, recording a time, wherein the number of rounds starting time is defined as T moment, namely, the starting time of the first round corresponds to t=1 moment, and so on, the T-th moment _max The start time of the number of rounds corresponds to t=t _max Time of day.

(3) Standard configuration in design depth deterministic policy gradient method (Deep Deterministic Policy Gradient, DDPG), specifically including policy network and evaluation networkIs designed according to the design of (2); wherein the policy network comprises an online policy network μ (s|α ^μ ) And a target policy network μ (s|α ^μ' ) The method comprises the steps of carrying out a first treatment on the surface of the The evaluation network comprises an online evaluation network Q (s, a|alpha ^Q ) And a target evaluation network Q (s, a|α ^Q' )；

The design of the strategy network and the evaluation network by using the depth deterministic strategy gradient method in the step (3) specifically comprises the following steps:

(3-1) network μ (s|α) for online policies ^μ ) On-line evaluation network Q (s, a|α ^Q ) Initializing;

(3-2) construction and on-line policy network μ (s|α ^μ ) Target policy network μ (s|α ^μ' ) Construction and on-line evaluation of network Q (s, a|α ^Q ) Target evaluation network Q (s, a|α ^Q' ) And copying parameters of the online policy network and the online evaluation network to respective target network parameters, namely alpha ^μ' ←α ^μ and α^Q' ←α ^Q The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the main assistance parameter alpha of the exoskeleton is used as a parameter to be optimized based on a deep reinforcement learning method, s refers to the state of the exoskeleton, and a refers to the action of the exoskeleton; the experience playback pool R is initialized.

The exoskeleton state s in the step (3-2) comprises swing phase assistance amplitude A, current gait cycle T and swing phase cycle T of the current gait cycle _b Flexion angle θ of hip joint of exoskeleton wearer, maximum flexion angle θ of hip joint at current gait cycle _max Minimum flexion angle theta to hip joint in current gait cycle _min The method comprises the steps of carrying out a first treatment on the surface of the The action a of the exoskeleton is the power assisting amount of the exoskeleton, and the power assisting direction of the exoskeleton is always positive, namely, the exoskeleton is vertically upward.

(4) Enumerating the scenario number E from 1 to E, namely performing E times of convergence on the main assistance parameter alpha of the exoskeleton, wherein the state of the exoskeleton at the initial moment can be obtained at the beginning of each scenario;

(5) Acquiring an initial state:

when each scenario in step (4) begins, it is necessary to have the exoskeleton wearer walk tau normally without assistance for a time interval and obtain the exoskeletonState of iliac as state s of exoskeleton at initial time at time t=1 ₁ Specifically comprises an initial moment swing phase power-assisted amplitude A ₁ Initial moment exoskeleton wearer hip joint flexion angle θ ₁ Gait cycle T at initial moment ₁ Swing phase period T of initial moment gait cycle _b1 Maximum flexion angle theta of hip joint in initial moment gait cycle _max 1, minimum flexion angle theta of hip joint in gait cycle at initial moment _min ,1；

The state s of the exoskeleton at the initial time in the step (5) ₁ The specific obtaining method of (2) comprises the following steps:

(5-1) enabling the exoskeleton wearer to walk tau normally without assistance, placing MEMS attitude sensors at middle positions of the rear parts of the left and right thighs of the wearer of the flexible exoskeleton robot, collecting hip joint flexion angle parameters of the wearer during normal walking in real time, and taking the flexion angle of the hip joint at the end of walking of the exoskeleton wearer as the flexion angle theta of the hip joint of the exoskeleton wearer at the initial time ₁ ；

(5-2) acquiring the hip joint flexion angle parameters of the wearer during normal walking without assistance in real time, acquiring the flexion angle parameter curves of the hip joint of the wearer through the steps (1-1) and (1-2), and recording the peak time as t _{Wave crest} The trough moment is recorded as t _{Trough of wave} And recording hip joint flexion angles corresponding to the peaks and hip joint flexion angles of the valleys;

(5-3) subtracting the last trough moment from the trough moment appearing before the end of the interval of normal walking tau under the condition of no power of the wearer as the initial moment gait cycle T ₁ ；

(5-4) taking the last wave trough moment appearing before the end of the interval of the wearer walking tau minus the wave trough moment before the wave trough moment as the swing phase period I of the gait cycle of the initial moment, and recording as T _b1,1 ；

(5-5) the swing phase period II of the gait cycle, which is obtained by subtracting the previous peak time from the trough time appearing the next-to-last time, as the initial time, is denoted as T _b1,2 ；

(5-6) the swing phase period III of the gait cycle, which is obtained by subtracting the previous peak time from the last trough time as the initial time, is denoted as T _b1,3 ；

(5-7) averaging the three swing phase periods obtained in the steps (5-4, 5-5) and (5-6), and obtaining the swing phase period of the next gait cycle, wherein the swing phase period is taken as the swing phase period of the gait cycle at the initial moment, namely:

(5-8) taking the hip joint flexion angle corresponding to the last trough moment as the minimum flexion angle theta of the hip joint in the gait cycle at the initial moment _min 1, the maximum flexion angle theta of the hip joint under the gait cycle of the initial moment of the flexion angle of the hip joint corresponding to the last peak moment _max ,1；

(5-9) initial time swing phase Power amplitude A ₁ The amplitude A of the swing phase power assisting is equal to the amplitude A of the swing phase power assisting which is set by people;

(6) Time wheel from 1 to T _max Enumerating, namely, recording T moment at the beginning of each time round, wherein the enumeration time round is to perform T in each scenario number _max Sub-steps (7) to (13) aimed at the exoskeleton performing the selection of T by the online policy network under each scenario _max And the exoskeleton acts, so that enough data sets are generated for parameter training, and the reliability of training results is improved. And T is _max Often the value of (c) is large enough to enable the optimized parameters to converge.

(7) The online policy network selects the action of the exoskeleton at the time t according to the following steps:

a _t ＝μ(s _t |α ^μ )+Noise (6)

the Noise is used for expanding the value range, so that the range of actions of selecting the exoskeleton at the moment t is larger;

(8) The exoskeleton performs the action selected in step (7), and the exoskeleton wearer performs according to the exoskeletonThe action lasts for a time interval of tau, and scalar rewards r of flexible exoskeleton feedback can be obtained _t And the exoskeleton state s at the next moment _t+1 ；

Scalar rewards r of the flexible exoskeleton feedback in step (8) _t The specific form is as follows:

wherein W is the walking ratio, W _tv The walking ratio of the healthy elderly is set.

The value of the walking ratio in the step (8) is defined as the ratio of the step length to the step frequency, and the specific form is shown in the formula (8):

in the formula ,D_t+1 For the next time step length, the unit is m, N is the step frequency, the unit is steps/s, T _t+1 For the next moment gait cycle, the unit is s;

the next time step size can be obtained by:

D _t+1 ＝l(θ _max ,t+1-θ _min ,t+1) (9)

where l is the leg length of the wearer of the flexible exoskeleton robot; θ _max T+1 is the maximum flexion angle of the hip joint at the next moment in gait cycle, θ _min T+1 is the minimum flexion angle of the hip joint at the next moment in the gait cycle.

The exoskeleton state s at the next time in the step (8) _t+1 Comprises a swing phase power-assisted amplitude A at the next moment _t+1 The next moment in time the angle of flexion theta of the hip joint of the exoskeleton wearer _t+1 Gait cycle T at next moment _t+1 Swing phase period T of next moment gait period _bt+1 Maximum flexion angle theta of hip joint in next moment gait cycle _max T+1, minimum flexion angle θ of hip joint at next moment gait cycle _min T+1; the lower part is provided withExoskeleton state s at one time _t+1 Obtained by the following steps:

(8-1) the exoskeleton executes the action selected in the step (7), the time interval of the exoskeleton wearer walking tau is used for acquiring hip joint buckling angle parameters of the exoskeleton wearer during walking in real time through the MEMS attitude sensor, and the buckling angle of the hip joint of the exoskeleton wearer at the moment of the end of walking of the exoskeleton wearer is used as the buckling angle theta of the hip joint of the exoskeleton wearer at the next moment _t+1 ；

(8-2) acquiring hip joint flexion angle parameters of the exoskeleton wearer during walking in real time, acquiring a flexion angle parameter curve of the hip joint of the exoskeleton wearer through the steps (1-1) and (1-2), and recording the peak time as t _{Wave crest} The trough moment is recorded as t _{Trough of wave} And recording hip joint buckling angles corresponding to the wave crests and hip joint buckling angles corresponding to the wave troughs;

(8-3) subtracting the last trough moment from the last trough moment occurring before the end of the interval of walking τ of the exoskeleton wearer as the next moment in gait cycle T _t+1 The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the last wave trough moment appearing before the end of the time interval of walking tau of the exoskeleton wearer minus the previous wave crest moment of the wave trough moment is taken as the swing phase period I of the gait period of the next moment and is marked as T _bt+1,1 And the swing phase period II of the gait cycle at the next moment is marked as T by subtracting the previous peak moment of the trough moment from the trough moment appearing at the last time _bt+1,2 The swing phase period III of the gait cycle at the next moment is marked as T by subtracting the previous peak moment of the trough moment from the trough moment appearing at the third time at the last time _bt+1,3 Averaging the three swing phase periods, as shown in formula (10), to obtain the swing phase period of the next gait cycle, and taking the swing phase period as the swing phase period of the next gait cycle:

(8-4) taking the hip joint flexion angle corresponding to the last peak moment as the hip joint under the gait cycle of the next momentMaximum flexion angle θ of the joint _max T+1, the hip joint buckling angle corresponding to the last trough moment is taken as the minimum buckling angle theta of the hip joint in the gait cycle at the next moment _min ,t+1。

(8-5) amplitude A of swing phase assistance at the next time _t+1 The amplitude A of the swing phase power assisting is equal to the amplitude A of the swing phase power assisting which is set by people;

(9) State transition process:

state s of exoskeleton at time t _t Action a of exoskeleton at t time obtained in step (7) _t The state s of the exoskeleton at the moment next to t obtained in the step (8) _t+1 Scalar rewards r for flexible exoskeleton feedback _t Stored as a training data set in the experience playback pool R for parameter training;

state s of exoskeleton at time t _t The acquisition method of (a) specifically refers to: the state of the exoskeleton at the time t is the same as the state of the exoskeleton at the next time obtained by executing the step (8) under the t-1 time round of the current number of nodes.

(10) Randomly sampling N state conversion processes in the step (9) as batch training data to perform parameter training;

the parameter training in the step (10) specifically comprises the following steps:

(10-1) calculating a loss of the online evaluation network, the loss being defined as a mean square error (Mean Squared Error, MSE) form, as shown in equation (11), for updating the online evaluation network parameters:

wherein ,L(α^Q ) The loss function value of the network is evaluated on line and is used for training and optimizing; q(s) _i ,a _i |α ^Q ) The method is an evaluation value of an online evaluation network, namely a Q value, and the input of the online evaluation network is the state and action of the exoskeleton in the ith state conversion process; y is _i Refers to the target of Q value, namely:

y _i ＝r _i +γQ'(s _i+1 ,μ'(s _i+1 |α ^μ' )|α ^Q ) (12)

in the formula ,r_i Scalar rewards referring to the ith state transition process; s is(s) _i+1 Refers to the next exoskeleton state of the ith state transition procedure; gamma is the discount factor, gamma is [0,1 ]]The method comprises the steps of carrying out a first treatment on the surface of the After Q'(s) _i+1 ,μ'(s _i+1 |α ^μ' )|α ^Q' ) Is a nesting of two functions, the first being Q'(s) _i+1 ,μ'(s _i+1 |α ^μ' )|α ^Q' ) The Q function generated by the target evaluation network is input as the next exoskeleton state and action of the ith state transition process, the next exoskeleton action of the ith state transition process is generated by the target policy network and is a second function μ'(s) _i+1 |α ^μ' ) The input is the next exoskeleton state of the ith state transition procedure;

(10-2) updating the online policy network parameters as shown in formula (13):

wherein ,

gradient values of network parameters of the online strategy are referred; />

Refers to the online evaluation of the gradient of the Q value of the network to action a, which is defined by μ (s _i |α ^μ ) Generating an online strategy network; />

The gradient of network parameters of the online strategy is referred; in the formula->

And->

Is a multiplication relationship;

(10-3) updating the target policy network parameters and the target evaluation network parameters as shown in formula (14):

wherein ,α^μ' Refers to target policy network parameters; alpha ^μ Refers to online policy network parameters; alpha ^Q' Refers to target evaluation network parameters; alpha ^Q The network parameters are evaluated on line; sigma refers to an updated scaling parameter, and sigma generally takes a small value.

In summary, after completing the step (10), network parameters in the policy network and the evaluation network can be updated once to promote the parameters of each network in the policy network and the evaluation network to converge, wherein the network parameters in the policy network include an online policy network parameter α of an online policy network ^μ And target policy network parameter alpha of the target policy network ^μ' The method comprises the steps of carrying out a first treatment on the surface of the The network parameters in the evaluation network include an online evaluation network parameter α of the online evaluation network ^Q And a target evaluation network parameter alpha of the target evaluation network ^Q' The method comprises the steps of carrying out a first treatment on the surface of the Finally, parameter convergence of each network in the strategy network and the evaluation network is realized, namely, the walking ratio of the exoskeleton wearer is promoted to approach to the walking ratio of the set healthy elderly, and finally, the walking ratio of the exoskeleton wearer is stabilized at the walking ratio of the set healthy elderly.

(11) After the step (7) to the step (10) are executed, completing one time of time round, ending enumeration, adding 1 to the time round, and continuing to execute the step (7) to the step (10); until the parameters of each network in the strategy network and the evaluation network are converged, the exoskeleton main assistance parameter alpha to be optimized based on the deep reinforcement learning method is equal to the target strategy network parameter alpha of the target strategy network in the strategy network ^μ' Target policy network parameter alpha of target policy network in policy network ^μ' Convergence, namely convergence of the main assistance parameter alpha of the exoskeleton, which represents the optimization based on the deep reinforcement learning method under the condition number, is carried out, wherein the walking ratio of the exoskeleton wearer is stabilized at the set walking ratio of the healthy elderly people, and thenEnding the current number of the emotion nodes and carrying out the next number of the emotion nodes;

(12) After the steps (5) to (11) are executed, the number e of the nodes is counted once, the enumeration is finished, and e=e+1 is caused to continue to execute the steps (5) to (11); until the end of each scenario, the target policy network parameter alpha of the target policy network in the policy network ^μ' All converge on the same value, namely, the main assistance parameters alpha representing the exoskeleton are converged on the same value, then the main assistance parameters alpha representing the exoskeleton are regarded as being based on the main assistance parameters alpha of the exoskeleton to be optimized by the deep reinforcement learning method, and the main assistance parameters alpha of the exoskeleton can be utilized to realize the optimal assistance of the exoskeleton, so that the walking ratio of the exoskeleton wearer is always stabilized at the walking ratio of the set healthy elderly, and the rehabilitation exercise of the exoskeleton wearer is realized.

The working principle of the invention is as follows: the exoskeleton main assistance parameter alpha is optimized by adopting a deep reinforcement learning-based optimization method; the deep reinforcement learning optimization method, namely a depth deterministic strategy gradient method (Deep Deterministic Policy Gradient, DDPG), builds a strategy network and an evaluation network, solves the problem of flexible exoskeleton continuity control, and generates a data set for parameter training by acquiring and processing hip joint buckling angle information of an exoskeleton wearer in real time so as to realize self-adaptive optimization of main assistance parameters of the exoskeleton;

The invention has the advantages that: the method is applied to rehabilitation treatment of lower limb dysfunctional patients, realizes gait rehabilitation flexible exoskeleton continuous assistance, determines exoskeleton main assistance parameters by an exoskeleton assistance curve equation, solves the problem of flexible exoskeleton continuous control by adopting a depth deterministic strategy gradient method in deep reinforcement learning, and generates a data set for parameter training by acquiring and processing hip joint flexion angle information of exoskeleton wearers in real time, thereby realizing flexible exoskeleton main assistance parameter self-adaptive optimization and ensuring safety of lower limb dysfunctional patients in the rehabilitation training process to a greater extent.

(IV) description of the drawings:

fig. 1 is a schematic diagram of gait cycle in an exoskeleton main power-assisted parameter optimization method based on deep reinforcement learning according to the present invention.

Fig. 2 is a schematic flow diagram of an exoskeleton main assistance parameter optimization method based on deep reinforcement learning.

(V) the specific embodiment:

a more detailed description will be given below in connection with specific embodiments. It is to be understood that these examples are illustrative of the present invention and are not intended to limit the scope of the present invention. After reading the specific steps and related matters taught by the present invention, those skilled in the relevant arts can make various modifications or applications of the present invention, which equivalent forms are also within the scope of the claims appended hereto.

the swing phase power-assisted amplitude A can be regarded as a known value under the rated work of the power-assisted components and is determined by the rated output value of the power-assisted components, and the swing phase power-assisted amplitude is a known value and can be set manually under the rated work of the power-assisted components. For example: if the direct current motor is selected as the power assisting component, the rated output torque is T _{Force of force} =9549×p/N, in N/m. Wherein P is the rated power of the motor, N is the rated rotation speed of the motor, and the rated output force of the motor can be obtained according to actual conditions, and the unit is N.

Swing phase period T of current gait cycle _b The method comprises the steps of acquiring the flexion angle parameters of the hip joint of a wearer during walking through an MEMS attitude sensor to acquire the flexion angle parameter curves of the hip joint of the wearer, and averaging the previous three swing phase periods to obtain the next swing phase period The swing phase period method of the gait, namely the swing phase period of the next gait obtained by averaging can be used as the swing phase period of the current gait period by the first three swing phase periods, so the swing phase period T of the current gait period _b Can be regarded as a known value, obtained from formula (2):

the method specifically comprises the following steps: the MEMS attitude sensor is placed at the middle position of the rear parts of the left thigh and the right thigh of a wearer of the flexible exoskeleton robot, and hip joint flexion angle parameters of the wearer during normal walking of the wearer are acquired in real time to acquire a flexion angle parameter curve of the hip joint of the wearer, and as shown in figure 1, the peak moment is recorded as t _{Wave crest} The trough moment is recorded as t _{Trough of wave} And the hip joint buckling angle corresponding to the peak hip joint buckling angle and the trough hip joint buckling angle are recorded, and the current gait cycle shown in the formula (3) and the swing phase cycle of the gait cycle shown in the formula (4) can be further calculated as follows:

T(k)＝t _{trough of wave} (k)-t _{Trough of wave} (k-1) (3)

T _b (k)＝t _{Wave crest} (k)-t _{Trough of wave} (k) (4)

Correspondingly, the maximum hip joint buckling angle theta corresponding to the current gait cycle in fig. 2 can be obtained _max (k) Minimum hip flexion angle θ _min (k) For the state of the exoskeleton at the initial time in step (5) and the state of the exoskeleton at the next time in step (8).

The method for acquiring the buckling angle parameter curve of the hip joint of the wearer comprises the following steps:

In the step (1), under the rated operation of the power assisting component, the swing phase power assisting amplitude A can be regarded as a known value; acquiring hip joint flexion angle parameters of a wearer during walking through an MEMS attitude sensor to acquire a flexion angle parameter curve of the hip joint of the wearer, and adopting a method that the first three swing phase periods in the flexion angle parameter curve are averaged to obtain the swing phase period of the next gait, namely, the swing phase period of the next gait is averaged to obtain the swing phase period of the current gait period as the swing phase period of the current gait period, so that the swing phase period T of the current gait period _b Can be regarded as a known value. The invention only optimizes the waveform control parameter alpha based on the deep reinforcement learning method.

(1-1) acquiring hip joint buckling angle parameter signals of a wearer of the flexible exoskeleton robot by using an MEMS attitude sensor, converting the hip joint buckling angle parameter signals into digital quantity signals, transmitting the digital quantity signals to a singlechip, and transmitting the digital quantity signals to a PC (personal computer) end by using the singlechip;

The exoskeleton main assistance parameter training optimization is realized by using a depth deterministic strategy gradient method (DDPG) in the deep reinforcement learning, and the exoskeleton main assistance parameter is determined to be alpha. The specific flow is shown in a flow chart of FIG. 2 based on deep reinforcement learning assistance parameter optimization.

(2) Parameters are set. The method comprises the steps that the walking time interval of an exoskeleton wearer is set to be tau=5-7 s each time, the time interval can be properly increased, the exoskeleton wearer can walk for at least 3 steps, the swing phase period of the current gait cycle is acquired, the exoskeleton wearer can stably stand when finishing the walking time interval each time, and the exoskeleton can judge the boosting condition again after each time of advancing; presetting a maximum scenario number E, a batch sampling number N and a maximum time wheel T of each scenario _max The method comprises the steps of carrying out a first treatment on the surface of the And, the primary scenario corresponds to convergence of the primary parameters; said setting maximum time per episode wheel T _max The number of rounds to be performed under each scenario is set, and each round corresponds to the number of time intervals, namely: the maximum convergence of the main assistance parameter alpha of the exoskeleton is needed to be completed at each time _max Wheels, each wheel requiring a time interval for the exoskeleton wearer to walk τ; and, starting the number of rounds, recording a time, wherein the number of rounds starting time is defined as T moment, namely, the starting time of the first round corresponds to t=1 moment, and so on, the T-th moment _max The start time of the number of rounds corresponds to t=t _max Time of day.

(3) Standard configurations in depth deterministic policy gradient method (DDPG) were designed, including policy network, evaluation network as shown in fig. 2. The policy network includes an online policy network and a target online network, and the evaluation network includes an online evaluation network and a target evaluation network. Initializing an online policy network μ (s|α ^μ ) On-line evaluation network Q (s, a|α ^Q ) Constructing a target policy network mu (s|alpha ^μ' ) Constructing a target evaluation network Q (s, a|alpha ^Q' ) And copying parameters of the online policy network and the online evaluation network to respective target network parameters, namely alpha ^μ' ←α ^μ and α^Q' ←α ^Q The method comprises the steps of carrying out a first treatment on the surface of the Wherein alpha refers to parameters to be optimized based on a deep reinforcement learning method, s refers to the state of the exoskeleton, and a refers to the action of the exoskeleton; initializing experience playbackPool R;

specifically, the exoskeleton state s includes swing phase assist amplitude a, current gait cycle T, and swing phase period T of the current gait cycle _b Flexion angle θ of hip joint of exoskeleton wearer, maximum flexion angle θ of hip joint at current gait cycle _max Minimum flexion angle theta to hip joint in current gait cycle _min The method comprises the steps of carrying out a first treatment on the surface of the The action a of the exoskeleton is the power assisting amount of the exoskeleton, and the power assisting direction of the exoskeleton is always positive, namely, the exoskeleton is vertically upward. The swing phase power-assisted amplitude is determined by the rated output value of a power-assisted component and can be set manually.

The design of the strategy network and the evaluation network by using the depth deterministic strategy gradient method comprises the following steps:

(5) An initial state is acquired.

When each scenario in step (4) begins, it is necessary to have the exoskeleton wearer walk τ normally without assistance for a time interval and obtain the exoskeleton state as the initial exoskeleton state s at time t=1 ₁ Specifically comprises an initial moment swing phase power-assisted amplitude A ₁ Initial moment exoskeleton wearer hip joint flexion angle θ ₁ Gait cycle T at initial moment ₁ Swing phase period T of initial moment gait cycle _b1 Maximum flexion angle theta of hip joint in initial moment gait cycle _max 1, minimum flexion angle theta of hip joint in gait cycle at initial moment _min ,1；

(5-2) acquiring the hip joint flexion angle parameters of the wearer during normal walking without assistance in real time, acquiring the flexion angle parameter curves of the hip joint of the wearer through the steps (1-1) and (1-2), and recording the peak time as t as shown in fig. 1 _{Wave crest} The trough moment is recorded as t _{Trough of wave} And recording hip joint flexion angles corresponding to the peaks and hip joint flexion angles of the valleys;

(5-4) subtracting the last trough moment occurring before the end of the interval of wearer walking τ from thisThe swing phase period I of the gait cycle taking the previous peak moment of the trough moment as the initial moment is recorded as T _b1,1 ；

(6) Enumerating time wheel from 1 to T _max Enumeration is performed, and the time t is recorded at the beginning of each time round.

(7) The online policy network selects the action of the exoskeleton at time t as in fig. 2 according to the following equation:

a _t ＝μ(s _t |α ^μ )+Noise

(8) The exoskeleton performs the action selected in step (7), and the exoskeleton wearer can obtain the actions shown in fig. 2 for a time interval of τ based on the action performed by the exoskeletonScalar rewards r for flexible exoskeleton feedback _t And the exoskeleton state s at the next moment _t+1 ；

Scalar rewards r for flexible exoskeleton feedback _t The specific form is as follows:

The results of previous studies indicate that the walking ratio can be used to describe gait patterns, which do not vary significantly with the physical ability of the subject, the stability of walking, the degree of concentration, etc., for a particular subject. While for different healthy individuals, the walking ratio is not significantly different, and the walking ratio of the normal gait of the old aged over 60 years is usually between 0.0044 and 0.0055.

The value of the walking ratio is defined as the ratio of the step size to the step frequency, and the specific form is shown in the following formula:

the next time step size can be obtained by:

D _t+1 ＝l(θ _max ,t+1-θ _min ,t+1)

Exoskeleton state s at next moment _t+1 Comprises a swing phase power-assisted amplitude A at the next moment _t+1 The next moment in time the angle of flexion theta of the hip joint of the exoskeleton wearer _t+1 Gait at next momentPeriod T _t+1 Swing phase period T of next moment gait period _bt+1 Maximum flexion angle theta of hip joint in next moment gait cycle _max T+1, minimum flexion angle θ of hip joint at next moment gait cycle _min T+1; the exoskeleton state s at the next moment _t+1 Obtained by the following steps:

(8-4) taking the hip joint flexion angle corresponding to the last peak moment as the maximum flexion angle θ of the hip joint in the gait cycle at the next moment _max T+1, the hip joint buckling angle corresponding to the last trough moment is taken as the minimum buckling angle theta of the hip joint in the gait cycle at the next moment _min ,t+1。

(9) State transition process: state s of exoskeleton at time t _t Action a of exoskeleton at t time obtained in step (7) _t The state s of the exoskeleton at the moment next to t obtained in the step (8) _t+1 Scalar rewards r for flexible exoskeleton feedback _t Stored as a training data set in the experience playback pool R for parameter training;

state s of exoskeleton at time t _t The acquisition method of (a) specifically refers to: the state of the exoskeleton at the time t is the same as the state of the exoskeleton at the next time obtained by executing the step (8) under the t-1 time round of the current number of nodes. For example t=2 moment exoskeleton state s _t And the exoskeleton state s obtained by the step (8) at the next moment in time under the 1 st time round of the current number of nodes _t+1 The same applies to the state s of the exoskeleton at time t _t Including the flexion angle theta of the hip joint of the exoskeleton wearer at time t _t Gait cycle T at time T _t Maximum flexion angle theta of hip joint in gait cycle at moment t _max Minimum buckling angle theta of hip joint in gait cycle at time t and time t _min And t. Specifically, at time t=1, the state of the exoskeleton at time t is obtained by step (5).

(10) Randomly sampling N steps (8) state transitions, as shown in FIG. 2, N (s _i ,a _i ,r _i ,s _i+1 ) Performing parameter training as a batch of training data, wherein steps (10) to (12) are specific processes of parameter training; wherein s is _i Representing the state of the exoskeleton of the ith state transition process, a _i Representing the motion of the exoskeleton of the ith state transition process, r _i Scalar rewards, s, representing the ith state transition process _i+1 Representing the next exoskeleton state of the ith state transition process.

y _i ＝r _i +γQ'(s _i+1 ,μ'(s _i+1 |α ^μ' )|α ^Q ) (12)

(10-2) updating the online policy network parameters as shown in formula (13):

wherein ,

gradient values of network parameters of the online strategy are referred; />

And->

Is a multiplication relationship;

(10-3) see fig. 2 for slow updates of target policy network parameters and target evaluation network parameters, as shown in equation (14):

wherein ,α^μ' Refers to target policy network parameters; alpha ^μ Refers to online policy network parameters; alpha ^Q' Refers to target evaluation network parameters; alpha ^Q The network parameters are evaluated on line; sigma refers to the updated scaling parameters, and typically takes on very small values, e.g. 0.001, i.e. it means that updating of the target policy network parameters and the target evaluation network parameters is a slow process, largely preserving the values of the target policy network parameters and the target evaluation network parameters.

In summary, after completing the step (10), the network parameters in the policy network and the evaluation network can be updated once to promote the parameters of each network in the policy network and the evaluation network to converge, wherein,the network parameters in the policy network include an online policy network parameter α of the online policy network ^μ And target policy network parameter alpha of the target policy network ^μ' The method comprises the steps of carrying out a first treatment on the surface of the The network parameters in the evaluation network include an online evaluation network parameter α of the online evaluation network ^Q And a target evaluation network parameter alpha of the target evaluation network ^Q' The method comprises the steps of carrying out a first treatment on the surface of the Finally, parameter convergence of each network in the strategy network and the evaluation network is realized, namely, the walking ratio of the exoskeleton wearer is promoted to approach to the walking ratio of the set healthy elderly, and finally, the walking ratio of the exoskeleton wearer is stabilized at the walking ratio of the set healthy elderly.

(11) After the step (7) to the step (10) are executed, completing one time of time round, ending enumeration, adding 1 to the time round, and continuing to execute the step (7) to the step (10); until the parameters of each network in the strategy network and the evaluation network are converged, the exoskeleton main assistance parameter alpha to be optimized based on the deep reinforcement learning method is equal to the target strategy network parameter alpha of the target strategy network in the strategy network ^μ' Target policy network parameter alpha of target policy network in policy network ^μ' Convergence, namely convergence of exoskeleton main power assisting parameters alpha which represent the exoskeleton to be optimized based on the deep reinforcement learning method under the condition number, wherein the walking ratio of an exoskeleton wearer is stabilized at the set walking ratio of healthy old people, and ending the current condition number and carrying out the next condition number;

In the case of one episode, the number of episodes,the main assistance parameter alpha of the exoskeleton to be optimized based on the deep reinforcement learning method converges in advance, namely when the main assistance parameter alpha of the exoskeleton to be optimized based on the deep reinforcement learning method converges at a certain value and does not change any more, the current time round number is calculated <T _max The scenario number is enumerated to end and let e=e+1.

In continuous motion space, motion is a continuous floating point number (e.g., exoskeleton power E [0, A is controlled by the exoskeleton power curve equation)]Including not only the magnitude of the force, but also the direction of the force), in said step (7) the online policy network employs a deterministic policy μ (s|α ^μ ) The output value of the selected action, namely the deterministic strategy, is a specific floating point number which represents the specific action, so that the method is suitable for continuous action space and can be used for solving the problem of flexible exoskeleton continuity control.

Finally, when the number of nodes in each case is over, the target strategy network parameters of the target strategy network in the strategy network are converged to the same value, namely, the main assistance parameters of the representative exoskeleton are converged to the same value, and the main assistance parameters of the exoskeleton are optimized based on the deep reinforcement learning method, so that the optimal assistance of the exoskeleton can be realized by utilizing the main assistance parameters of the exoskeleton, the walking ratio of the exoskeleton wearer is stabilized at the set walking ratio of healthy elderly people, and the rehabilitation exercise of the exoskeleton wearer is realized.

Claims

1. The exoskeleton main assistance parameter optimization method based on deep reinforcement learning is characterized by comprising the following steps of:

(1) Determining optimization parameters;

in the formula ,F_assist For the real-time power assisting, A is swing phase power assisting amplitude, t ^* Is at presentTime from moment to moment of starting assistance, T _b For the swing phase period of the current gait cycle, alpha is an exoskeleton main assistance parameter, and serves as a waveform control parameter of a formula (1) to change the assistance peak position, wherein the value range is-1;

(2) Setting parameters:

setting the walking time interval of the exoskeleton wearer to be T=5-7 s, properly increasing the time interval, ensuring that the exoskeleton wearer can walk at least 3 steps for acquiring the swing phase period of the current gait cycle, enabling the exoskeleton wearer to stably stand when finishing the walking time interval, and judging the boosting condition again after each advancing; presetting a maximum scenario number E, a batch sampling number N and a maximum time wheel T of each scenario _max ；

(3) Standard configuration in a depth deterministic strategy gradient method is designed, which specifically comprises the design of a strategy network and an evaluation network; wherein the policy network comprises an online policy network μ (s|α ^μ ) And a target policy network μ (s|α ^μ' ) The method comprises the steps of carrying out a first treatment on the surface of the The evaluation network comprises an online evaluation network Q (s, a|a ^Q ) And a target evaluation network Q (s, a|α ^Q' )；

(5) Acquiring an initial state:

when each scenario in step (4) begins, it is necessary to have the exoskeleton wearer walk normally without assistance for a time interval of T and obtain the state of the exoskeleton as the state s of the exoskeleton at the initial time of time t=1 ₁ Specifically comprises an initial moment swing phase power-assisted amplitude A ₁ Initial moment exoskeleton wearer hip joint flexion angle θ ₁ Gait cycle T at initial moment ₁ Swing phase period T of initial moment gait cycle _b1 Maximum flexion angle theta of hip joint in initial moment gait cycle _max 1, minimum flexion angle theta of hip joint in gait cycle at initial moment _min ,1；

(6) Time of dayThe space wheel is from 1 to T _max Enumerating, namely, recording T moment at the beginning of each time round, wherein the enumeration time round is to perform T in each scenario number _max Sub-steps (7) to (13) aimed at the exoskeleton performing the selection of T by the online policy network under each scenario _max The action of the secondary exoskeleton is performed, so that a data set is generated for parameter training, and the reliability of training results is improved; and T is _max The larger the value of (1), the more times of enumeration and thus the more data are generated, so as to enable the optimized parameters to be converged;

a _t ＝μ(s _t |αμ)+Noise (6)

(8) The exoskeleton performs the action selected in the step (7), and the exoskeleton wearer can obtain scalar rewards r of the flexible exoskeleton feedback according to the action performed by the exoskeleton for a time interval of T _t And the exoskeleton state s at the next moment _t+1 ；

(9) State transition process:

(11) After the step (7) to the step (10) are executed, completing one time of time round, ending enumeration, adding 1 to the time round, and continuing to execute the step (7) to the step (10); until the parameters of each network in the strategy network and the evaluation network are converged, the exoskeleton main assistance parameter alpha to be optimized based on the deep reinforcement learning method is equal to the target strategy network parameter alpha of the target strategy network in the strategy network ^μ' Policy network meshTarget policy network parameter alpha of target policy network ^μ' Convergence, namely convergence of exoskeleton main power assisting parameters alpha which represent the exoskeleton to be optimized based on the deep reinforcement learning method under the condition number, wherein the walking ratio of an exoskeleton wearer is stabilized at the set walking ratio of healthy old people, and ending the current condition number and carrying out the next condition number;

2. The method for optimizing the main assistance parameters of the exoskeleton based on the deep reinforcement learning according to claim 1, wherein the swing phase assistance amplitude A is determined by a rated output value of an assistance component, and the swing phase assistance amplitude is a known value and can be set manually under the rated operation of the assistance component; swing phase period T of the current gait cycle _b The method comprises the steps of collecting hip joint buckling angle parameters of a wearer during walking by using an MEMS attitude sensor to obtain a buckling angle parameter curve of the hip joint of the wearer, and adopting a method for averaging the first three swing phase periods in the buckling angle parameter curve to obtain the swing phase period of the next gait, namely averaging the first three swing phase periods to obtain the swing phase period of the next gait as the swing phase period of the current gait period; thus, the swing phase period of the current gait cycle corresponds to a known value, obtained by equation (2).

3. The method for optimizing exoskeleton main power assisting parameters based on deep reinforcement learning according to claim 2, wherein the swing phase period T of the current gait cycle _b The specific calculation method of (2) is as follows:

T(k)＝t _{Trough of wave} (k)-t _{Trough of wave} (k-1)(3)T _b (k)＝t _{Wave crest} (k)-t _{Trough of wave} (k)(4)

Wherein, the expression (3) indicates that the current gait cycle is calculated by the values of two adjacent wave valley points, wherein T is the current gait cycle; the swing phase period of the gait cycle is calculated by the values of adjacent wave peak points and wave trough points; further can obtain the maximum hip joint buckling angle theta corresponding to the current gait cycle _max (k) Minimum hip flexion angle θ _min (k)。

4. The method for optimizing exoskeleton main assistance parameters based on deep reinforcement learning according to claim 2, wherein the method for acquiring the buckling angle parameter curve of the hip joint of the wearer comprises the following steps:

(1-1) acquiring hip joint buckling angle parameter signals of a wearer of the flexible exoskeleton robot by using an MEMS attitude sensor, converting the hip joint buckling angle parameter signals into digital quantity signals, transmitting the digital quantity signals to a singlechip, and transmitting the digital quantity signals to a PC (personal computer) end by using the singlechip; the data transmission between the singlechip and the PC end is that the singlechip transmits to the PC end through a Bluetooth module by utilizing a wireless network through serial communication;

5. The method for optimizing exoskeleton main assistance parameters based on deep reinforcement learning according to claim 1, wherein the setting of the maximum scenario number E in the step (2) means setting the convergence number of the exoskeleton main assistance parameters a optimized by the deep reinforcement learning method, namely: the primary scenario corresponds to convergence of the primary parameters; said setting maximum time per episode wheel T _max The number of rounds to be performed under each scenario is set, and each round corresponds to the number of time intervals, namely: the maximum completion T of each time the main assistance parameter a of the exoskeleton is converged _max Wheels, each wheel requiring a time interval for the exoskeleton wearer to walk T; and, one round of beginning, record one time of time, and define round of beginning time as T moment, namely the first round of beginning time corresponds to t=1 moment, and so on, the T th _max The start time of the number of rounds corresponds to t=t _max Time of day.

6. The method for optimizing exoskeleton main assistance parameters based on deep reinforcement learning according to claim 1, wherein the design of the strategy network and the evaluation network by using the depth deterministic strategy gradient method in the step (3) specifically comprises the following steps:

(3-2) construction and on-line policy network μ (s|a ^μ ) Target policy network μ (s|α ^μ' ) Construction and on-line evaluation of network Q (s, a|α ^Q ) Target evaluation network Q (s, a|α ^Q' ) And copying parameters of the online policy network and the online evaluation network to respective target network parameters, namely alpha ^μ'←α and α^Q' ←α ^Q The method comprises the steps of carrying out a first treatment on the surface of the Wherein, the exoskeleton main assistance parameter alpha is used as the optimal value based on the deep reinforcement learning methodThe parameters of the transformation, s refers to the state of the exoskeleton, and a refers to the action of the exoskeleton; initializing an experience playback pool R;

7. The method for optimizing main assist parameters of exoskeleton based on deep reinforcement learning as set forth in claim 1, wherein said step (5) is characterized by a state s of exoskeleton at an initial time ₁ The specific obtaining method of (2) comprises the following steps:

(5-1) enabling the exoskeleton wearer to walk normally without assistance for a time interval of T, placing MEMS attitude sensors at middle positions of the rear parts of the left and right thighs of the wearer of the flexible exoskeleton robot, collecting hip joint flexion angle parameters of the wearer during normal walking in real time, and taking the flexion angle of the hip joint at the moment of the end of walking of the exoskeleton wearer as the flexion angle theta of the hip joint of the exoskeleton wearer at the initial moment ₁ ；

(5-3) subtracting the last trough moment from the trough moment which appears before the end of the time interval of normal walking T under the condition of no power of the wearer as the initial moment gait cycle T ₁ ；

(5-4) subtracting the previous peak time of the trough time from the last trough time of the wearer's walking T before the end of the interval The wobble phase period I of the state period is denoted as T _b1,1 ；

(5-9) initial time swing phase Power amplitude A ₁ Is equal to the swing phase power-assisted amplitude A set by people.

8. The method of deep reinforcement learning-based exoskeleton main boost parameter optimization of claim 1, wherein said flexible exoskeleton feedback scalar rewards r in step (8) _t The specific form is as follows:

wherein W is the walking ratio, W _tv The walking ratio of the healthy elderly is set;

the next time step size can be obtained by:

D _t+1 ＝l(θ _max ,t+1-θ _min ,t+1) (9)

9. The method of deep reinforcement learning-based exoskeleton main boost parameter optimization as claimed in claim 1, wherein the exoskeleton state s at the next time in said step (8) _t+1 Comprises a swing phase power-assisted amplitude A at the next moment _t+1 The next moment in time the flexion angle θt of the hip joint of the exoskeleton wearer _t+1 Gait cycle T at next moment _t+1 Swing phase period T of next moment gait period _bt+1 Maximum flexion angle theta of hip joint in next moment gait cycle _max T+1, minimum flexion angle θ of hip joint at next moment gait cycle _min T+1; the exoskeleton state s at the next moment _t+1 Obtained by the following steps:

(8-1) the exoskeleton executes the action selected in the step (7), the hip joint buckling angle parameters of the exoskeleton wearer during walking are acquired in real time through the MEMS attitude sensor at the time interval of the walking T of the exoskeleton wearer, and the buckling angle of the hip joint of the exoskeleton wearer at the moment of the end of walking of the exoskeleton wearer is taken as the buckling angle theta of the hip joint of the exoskeleton wearer at the next moment _t+1 ；

(8-2) collecting the outer part in real timeThe hip joint buckling angle parameter of a bone wearer during walking is obtained through the steps (1-1) and (1-2), and the peak moment is recorded as t _{Wave crest} The trough moment is recorded as t _{Trough of wave} And recording hip joint buckling angles corresponding to the wave crests and hip joint buckling angles corresponding to the wave troughs;

(8-3) subtracting the last trough moment from the last trough moment occurring before the end of the interval of walking T of the exoskeleton wearer as the next moment in gait cycle T _t+1 The method comprises the steps of carrying out a first treatment on the surface of the Meanwhile, the last wave trough moment appearing before the end of the time interval of walking T of the exoskeleton wearer is subtracted by the previous wave crest moment of the wave trough moment to be used as the swing phase period I of the gait period of the next moment and recorded as T _bt+1,1 And the swing phase period II of the gait cycle at the next moment is marked as T by subtracting the previous peak moment of the trough moment from the trough moment appearing at the last time _bt+1,2 The swing phase period III of the gait cycle at the next moment is marked as T by subtracting the previous peak moment of the trough moment from the trough moment appearing at the third time at the last time _bt+1,3 Averaging the three swing phase periods, as shown in formula (10), to obtain the swing phase period of the next gait cycle, and taking the swing phase period as the swing phase period of the next gait cycle:

(8-4) taking the hip joint flexion angle corresponding to the last peak moment as the maximum flexion angle θ of the hip joint in the gait cycle at the next moment _max T+1, the hip joint buckling angle corresponding to the last trough moment is taken as the minimum buckling angle theta of the hip joint in the gait cycle at the next moment _min ,t+1；

the state s of the exoskeleton at the time t in the step (9) _t Executing outside of the next moment obtained in the step (8) under the t-1 time round of the current number of nodesThe bone status is the same.

10. The method for optimizing exoskeleton main assistance parameters based on deep reinforcement learning according to claim 1, wherein the parameter training in the step (10) specifically comprises the following steps:

(10-1) calculating a loss of the online evaluation network, the loss being defined as a mean square error form, as shown in formula (11), for updating the online evaluation network parameters:

in the formula ,r_i Scalar rewards referring to the ith state transition process; s is(s) _i+1 Refers to the next exoskeleton state of the ith state transition procedure; y is a discount factor, Y e [0,1 ]]The method comprises the steps of carrying out a first treatment on the surface of the After Q'(s) _i+1 ,μ'(s _i+1 |α ^μ' )|α ^Q' ) Is a nesting of two functions, the first being Q'(s) _i+1 ,μ'(s _i+1 |α ^μ' )|α ^Q' ) The Q function generated by the target evaluation network is input as the next exoskeleton state and action of the ith state transition process, the next exoskeleton action of the ith state transition process is generated by the target policy network and is a second function μ'(s) _i+1 |α ^μ' ) The input is the next exoskeleton state of the ith state transition procedure;

(10-2) updating the online policy network parameters as shown in formula (13):

wherein ,

gradient values of network parameters of the online strategy are referred; />

And->

Is a multiplication relationship;

/>

wherein ,α^μ' Refers to target policy network parameters; alpha ^μ Refers to online policy network parameters; alpha ^Q' Refers to target evaluation network parameters; alpha ^Q The network parameters are evaluated on line; sigma refers to an updated scale parameter, the value of sigma represents that the updating of the target strategy network parameter and the target evaluation network parameter is a slow process, and the value is related to the walking ratio of the exoskeleton wearer and takes a smaller value;

to sum up, finish one step(10) The network parameters in the policy network and the evaluation network can be updated once to promote the parameters of each network in the policy network and the evaluation network to be converged, wherein the network parameters in the policy network comprise the online policy network parameters alpha of the online policy network ^μ And target policy network parameter alpha of the target policy network ^μ' The method comprises the steps of carrying out a first treatment on the surface of the The network parameters in the evaluation network include an online evaluation network parameter α of the online evaluation network ^Q And a target evaluation network parameter alpha of the target evaluation network ^Q' The method comprises the steps of carrying out a first treatment on the surface of the Finally, parameter convergence of each network in the strategy network and the evaluation network is realized, namely, the walking ratio of the exoskeleton wearer is promoted to approach to the walking ratio of the set healthy elderly, and finally, the walking ratio of the exoskeleton wearer is stabilized at the walking ratio of the set healthy elderly.