CN112472530A

CN112472530A - Reward function establishing method based on walking ratio trend change

Info

Publication number: CN112472530A
Application number: CN202011387443.2A
Authority: CN
Inventors: 孙磊; 李云飞; 董恩增; 佟吉刚; 陈鑫; 曾德添; 龚欣翔; 李成辉
Original assignee: Tianjin University of Technology
Current assignee: Tianjin University of Technology
Priority date: 2020-12-01
Filing date: 2020-12-01
Publication date: 2021-03-12
Anticipated expiration: 2040-12-01
Also published as: CN112472530B

Abstract

The invention discloses a method for establishing a reward function based on walking ratio trend change, which comprises the following steps: calculating a step length D of a wearer of the exoskeleton robot; calculating a gait cycle T (k); calculating a walking ratio W according to the step length D and the gait cycle T (k); establishing a walking ratio sampling sequence and scoring sampling sequences in the walking ratio sampling sequence; and establishing a reward function model. The reward function model based on the walking ratio trend change can be applied to an algorithm for optimizing exoskeleton parameters, so that the efficiency of reinforcement learning is enhanced, and the exoskeleton parameters are promoted to be rapidly converged.

Description

Reward function establishing method based on walking ratio trend change

The technical field is as follows:

the invention belongs to the technical field of robots, and relates to a method for establishing a walking ratio reward function of a gait rehabilitation flexible exoskeleton robot, which can be applied to a control parameter self-adaptive control task of a flexible exoskeleton based on a reinforcement learning method.

(II) background technology:

the flexible exoskeleton robot can assist the old people with inconvenience in walking and strengthening the leg strength of the human body. Has wide application in rehabilitation, daily trip and other aspects. Due to the fact that large individual difference exists between people, at present, control parameters of the exoskeleton robot mostly need to be adjusted according to the self motion characteristics of a wearer, time and labor are consumed, and the body change of the wearer cannot be tracked.

The reinforcement learning can search the optimal strategy in the interaction with the environment and can learn autonomously. Therefore, the parameter adaptability of the robot can be greatly improved by applying reinforcement learning to the exoskeleton. Since the goal of reinforcement learning is to maximize the cumulative prize, the prize function plays a very important role. In supervised learning, a supervisory signal is provided by the training data. In reinforcement learning, the reward function plays a role of monitoring signals, and an Agent (Agent) carries out strategy optimization according to rewards.

The reward function is the key of the learning efficiency of the intelligent agent, the reward function mostly depends on the design of human experts at present, and the reward function which is difficult to design is difficult to solve some complex decision problems. Therefore, researchers put forward Meta Learning (Meta Learning), simulation Learning (emulation Learning) and other modes, and the intelligent agent learns to summarize corresponding reward functions from good strategies for guiding the reinforcement Learning process. However, the simulation Learning requires alternating iterations of Inverse Reinforcement Learning (Inverse Reinforcement Learning) and Reinforcement Learning, the process is complicated, and the simulation Learning depends on expert samples, which is not suitable for some occasions lacking expert samples. Researchers propose solutions for this purpose, including setting auxiliary tasks, introducing curiosity mechanisms and the like, which are still limited by generalization capability, and corresponding prior information needs to be provided by experts according to specific tasks, so that the sparse reward problem of reinforcement learning cannot be solved in a general sense.

How to design a reward function for promoting the exoskeleton parameters to be rapidly converged aiming at the problem of flexible exoskeleton parameter self-adaptation is a problem which needs to be solved urgently at present.

(III) the invention content:

the invention aims to provide a method for establishing a reward function based on walking ratio trend change, which can overcome the defects of the prior art, can reflect the trend change of the walking ratio, calculates the step length and the gait cycle by utilizing the output data of an MEMS (Micro-Electro-Mechanical System) attitude sensor to obtain the walking ratio, and establishes the reward function based on the walking ratio trend change to promote the rapid convergence of flexible exoskeleton parameters and enhance the adaptivity of the parameters.

The technical scheme of the invention is as follows: a method for establishing a reward function based on walking ratio trend changes is characterized by comprising the following steps:

(1) collecting hip joint flexion angle parameter signals of a wearer of the flexible exoskeleton robot, and finding out the maximum flexion angle theta of the hip joint_maxAnd minimum flexion angle theta_minIf the leg length of the wearer of the flexible exoskeleton robot is known to be l, the step length D of the wearer of the flexible exoskeleton robot can be obtained;

D＝l(θ_max-θ_min) (1)

(2) the method comprises the steps of placing a sensor at the middle position of the rear parts of left and right thighs of a wearer of the flexible exoskeleton robot, collecting hip joint flexion angle parameters of the wearer during normal walking in real time to obtain a flexion angle parameter curve of the hip joint of the wearer, and recording a wave trough moment as t_{Trough of wave}And then the current gait cycle can be calculated as follows:

T(k)＝t_{trough of wave}(k)-t_{Trough of wave}(k-1) (2)

Namely: the current gait cycle is calculated by the values of two adjacent valley points;

the method for acquiring the flexion angle parameter curve of the hip joint of the wearer in the step (2) is as follows:

(2-1) collecting hip joint flexion angle parameter signals of a wearer of the flexible exoskeleton robot by using an attitude sensor, converting the hip joint flexion angle parameter signals into digital quantity signals, sending the digital quantity signals to a single chip microcomputer, and sending the digital quantity signals to a Personal Computer (PC) end; the data transmission between the single chip microcomputer and the PC end is that the single chip microcomputer transmits data to the PC end through a serial port communication and a Bluetooth module by using a wireless network.

(2-2) acquiring hip joint flexion angle parameter signals by using a serial port interface in MATLAB (matrix laboratory) installed at a PC (personal computer) end, and drawing a hip joint flexion angle parameter real-time curve through a plot function;

the real-time curve of the hip joint flexion angle parameter can also be directly displayed by using third-party upper computer software, such as an anonymous upper computer.

The collection of hip joint flexion angle parameter signals of the wearer of the flexible exoskeleton robot in the step (1) and the step (2) is realized through an MEMS attitude sensor, and the MEMS attitude sensor is provided with an ADC (analog to Digital converter) conversion module.

(3) Calculating the real sampling walking ratio W at the end of one gait cycle according to the step length D obtained in the step (1) and the gait cycle T (k) obtained in the step (2), as shown in the formula (3):

wherein W is the walking ratio, D is the step length, and the unit is m, N is the step frequency, and the unit is steps/min, T_{Step by step}Is gait cycle, in units of min;

(4) establishing a walking ratio sampling sequence, taking a sampling point every other period, setting a scoring mechanism shown in the formula (4) according to the number sequence convergence condition, and scoring sequence values in the analysis sequence;

|W_{at present}-W_Target|＜|W_{Last sampling point}-W_Target| (4)

Wherein, W_{At present}Walk ratio, W, of the current sample point_TargetTo set a good walking ratio for healthy elderly, W_{Last sampling point}The walking ratio of the last sampling point;

in the step (4), the sequence values in the analysis sequence are scored according to a scoring mechanism, specifically:

when W_{At present}＞W_TargetIf the walking ratio of the current sampling point is smaller than that of the last sampling point, setting the sequence value corresponding to the current time to be 1, otherwise, setting the sequence value to be 0;

when W_{At present}＜W_TargetIf the walking ratio of the current sampling point is larger than that of the last sampling point, setting the sequence value corresponding to the current time to be 1, otherwise, setting the sequence value to be 0;

selecting m sequence values containing the current time point from the analysis sequence, recording the number of 1 in the analysis sequence as P and the number of 0 in the analysis sequence as Q, and calculating the reward value after the exoskeleton robot executes the previous action according to a formula (5):

wherein Maximun is a maximum reward value set artificially, P is the number of 1 in an analysis sequence, Q is the number of 0 in the analysis sequence, and m is the number of sampling points in the analysis sequence;

a trend representing walking ratios over a plurality of cycles;

(5) sequentially scoring sampling sequences in the walking ratio sampling sequence to respectively obtain the number of P and Q, and obtaining a global reward value based on the walking ratio after the exoskeleton robot executes the previous action according to a global reward function formula (5); when the P value is high, i.e. P > Q, the walking ratio converges according to the expected trend, i.e. towards the walking ratio for a given healthy elderly, then the exoskeleton robot will get a positive reward; when Q is higher, i.e. Q > P walking ratio will diverge away from expected trend, the exoskeleton robot will get a negative reward;

(6) the reward function model is applied to a reinforcement learning algorithm for optimizing exoskeleton parameters, when the value function shown in the formula (6) is the maximum, the obtained strategy is the optimal strategy, and the adjustment of the walking ratio can be realized, so that the exoskeleton robot assists the old to walk and plays a role in rehabilitation.

v_π(s)＝E_π(R_t+1+γR_t+2+γ²R_t+3+...|S_t＝s) (6)

Wherein v is_π(s) is a cost function after action is taken in the case of strategy pi and state s; r is the reward function model mentioned above, R_t+1The reward at the moment t + 1; gamma is a reward attenuation factor at [0,1]To (c) to (d); s_tIs the state of the environment at time t.

The working principle of the invention is as follows: the range of the reward function is an important parameter that relates to the effectiveness of shaping and demonstrates the strongest impact on a simple reinforcement learning algorithm at runtime. Therefore, a maximum prize value is set, and the prize value is determined according to the proportion, so that the prize range can be restricted. When setting the global reward function, the goal is to constrain the overall trend of walking ratio over multiple cycles. Therefore, the trend of the walking ratio is scored, and if the current trend is convergent according to the target walking ratio, the current trend is set to be 1; if the current trend is divergence, setting 0; the number of 1 is P, the number of 0 is Q,

may represent a trend of walking ratio, P, over a plurality of cycles>Q is convergence, giving a positive reward, Q>P is divergence, giving a negative reward. The magnitude of the ratio indicates the degree of divergence or convergence, with better convergence yielding a larger reward value and overall more divergence yielding a smaller reward. Thus, the reward function is determined as shown in equation (5).

The problem of the walking ratio in a plurality of cycles converging according to an expected trend is solved through a global constraint reward function. The purpose of this reward function setting is to solve the problem of exoskeleton parameter adaptation. And judging the self-adaptation of the exoskeleton parameters by taking whether the value of the walking ratio is the walking ratio of the healthy old people or not as a reference. Through the reward function, after the intelligent agent executes the last action, the value of the current walking ratio is calculated, and the reward is obtained according to the reward function. The agent, the intelligent agent part of the robot, is the agent of the robot that is put into the environment to explore and learn. The intelligent agent accumulates the maximum reward, further adjusts the next action, outputs the parameters more suitable for the old to walk, obtains the action with the maximum reward, is the parameter of the action which enables the walking ratio to be always kept at the healthy walking ratio of the old, and is beneficial to the self-adaptive optimization of the exoskeleton parameters.

The method is mainly used for a parameter optimization algorithm of the exoskeleton robot. The criterion of the reasonability of the current exoskeleton parameters is whether the gait information accords with the gait information of healthy old people. In order to judge gait in real time, the project adopts the concept of walk ratio (walk radio) to describe the motion state of a human body, and the value is defined as the ratio of step length (m) to step frequency (step/s). Previous studies have shown that the walking ratio can be used to describe the gait pattern, which does not vary significantly for a particular subject with respect to physical performance, walking stability, concentration, etc. of the subject. The walking ratio has no significant difference for different healthy individuals, and the walking ratio of the normal gait of the old aged 60 years old or older is between 0.0044 and 0.0055.

The invention has the advantages that: the reward mechanism has the advantages that the reward value is determined according to the variation trend of the walking ratio, the walking ratio is subjected to global constraint in a plurality of periods, the problem that the behaviors of the flexible exoskeleton robot need to be subjected to constraint scoring by a reward function in reinforcement learning is solved, and the reward mechanism is simple and easy to implement; on the basis of improving the learning efficiency of the intelligent agent, the problem of sparse reward is avoided, namely the problem that the reward cannot be obtained in a long period of time in the learning of the intelligent agent does not exist. The blind exploration of the whole algorithm can be effectively avoided, the reinforcement learning efficiency of the flexible exoskeleton robot is improved, the robustness of the exoskeleton robot is enhanced, and the convergence of exoskeleton parameters according to an expected trend is ensured; the adaptability of exoskeleton parameters is improved.

(IV) description of the drawings:

fig. 1 is a schematic diagram illustrating an analysis sequence scoring principle of a walking trending reward mechanism in a method for establishing a reward function based on walking ratio trend changes according to the present invention.

Fig. 2 is a schematic view illustrating a gait cycle calculation principle in a method for establishing a reward function based on a walking ratio trend change according to the present invention.

Fig. 3 is a graph schematically showing the time-dependent change of the hip joint flexion angle in the method for establishing the reward function based on the walking ratio trend change according to the present invention.

(V) specific embodiment:

example (b): a method for establishing a reward function based on walking ratio trend changes is characterized by comprising the following steps:

(1) collecting hip joint flexion angle parameter signals of a wearer of the flexible exoskeleton robot by using the MEMS attitude sensor, and finding out the maximum flexion angle theta of the hip joint_maxAnd minimum flexion angle theta_minIf the leg length of the wearer of the flexible exoskeleton robot is known to be l, the step length D of the wearer of the flexible exoskeleton robot can be obtained;

D＝l(θ_max-θ_min) (1)

(2) placing the MEMS attitude sensors at the middle positions of the rear parts of the left thigh and the right thigh of a wearer of the flexible exoskeleton robot, and acquiring the hip joint flexion angle parameters of the wearer during normal walking in real time to acquire the flexion angle parameter curve of the hip joint of the wearer, wherein the wave trough time is recorded as t as shown in fig. 2_{Trough of wave}And then the current gait cycle can be calculated as follows:

T(k)＝t_{trough of wave}(k)-t_{Trough of wave}(k-1) (2)

the method for acquiring the flexion angle parameter curve of the hip joint of the wearer specifically comprises the following steps:

(2-1) converting the hip joint flexion angle parameter signal of the wearer of the flexible exoskeleton robot into a digital quantity signal, sending the digital quantity signal to a single chip microcomputer, and transmitting the digital quantity signal to a PC (personal computer) end by the single chip microcomputer through serial port communication and a Bluetooth module by using a wireless network;

(2-2) acquiring hip joint flexion angle parameter signals by using a serial port interface in MATLAB (matrix laboratory) installed at a PC (personal computer) end, and drawing a hip joint flexion angle parameter real-time curve through a plot function; and third-party upper computer software can also be directly used for displaying the curve, such as an anonymous upper computer.

(4) establishing a walking ratio sampling sequence, taking a sampling point every other period, setting a scoring mechanism shown as a formula (4) according to a number sequence convergence condition as shown in figure 1, and scoring sequence values in an analysis sequence;

|W_{at present}-W_Target|＜|W_{Last sampling point}-W_Target| (4)

selecting m sequence values containing the current time point from the analysis sequence, recording the number of 1 in the analysis sequence as P and the number of 0 in the analysis sequence as Q, and calculating the reward value after the exoskeleton robot executes the previous action according to a formula (5) if the specific working mode is shown in figure 1:

a trend representing walking ratios over a plurality of cycles;

v_π(s)＝E_π(R_t+1+γR_t+2+γ²R_t+3+...|S_t＝s) (6)

Wherein v is_π(s) is a cost function after action is taken in the case of strategy pi and state s; r is the reward function model mentioned above, R_t+1The reward at the moment t + 1; gamma is a bonus attenuation factor and is a number,in [0,1 ]]To (c) to (d); s_tIs the state of the environment at time t.

The following examples are given for illustrative purposes. It should be understood that these examples are for illustrative purposes only and are not intended to limit the scope of the present invention. After reading the detailed steps and associated contents of the present invention, the skilled person can make various modifications or applications of the present invention, and the equivalents thereof also fall within the scope of the claims appended to the present application.

For example, in an algorithm that optimizes the exoskeleton's power assist parameters, the actuator selects a according to an action strategy given initial parameters_tGiving the flexible exoskeleton execution of a_t；

A is a_tThe behavior is selected by Agent at time t, and after the behavior is executed by environment, the environment state is represented by s_tConversion to s_t+1；s_tThe method comprises the steps that the Agent receives the state from the flexible exoskeleton at the time t; s_t+1Is a scalar award r for receiving feedback from a flexible exoskeleton_tAnd is in the next state;

r_ti.e. the reward function in the present invention. Is shown as

Wherein Maximum is a Maximum reward value set artificially, P is the number of 1 in an analysis sequence, Q is the number of 0 in the analysis sequence, and m is the number of sampling points in the analysis sequence;

a trend representing walking ratios over a plurality of cycles;

the derivation of Q and P in the reward function is shown in the scoring mechanism of fig. 1, which establishes a sequence of step ratio samples, taking one sample at every other cycle:

then m sequence values containing the current time point are selected, the number of 1 in the analytic sequence is P, and the number of 0 in the analytic sequence is Q.

The walking ratio W is calculated by the formula:

in the formula, the calculation formula of the step length D is D ═ l (θ)_max-θ_min)。

Wherein, theta_maxIs the maximum flexion angle of the hip joint, theta_minIs the minimum flexion angle of the hip joint, and l is the leg length of the wearer of the flexible exoskeleton robot. Maximum flexion angle theta to hip joint_maxAnd minimum flexion angle theta_minThe measurement of (2) is realized by a MEMS attitude sensor.

As shown in fig. 2, in one gait cycle, there are two wave troughs and one wave crest, and the time of the wave trough is marked as t_{Trough of wave}Then gait cycle T_{Step by step}Is given by the formula T (k) ═ t_{Trough of wave}(k)-t_{Trough of wave}(k-1)。

Flexible exoskeleton execution a_tReturning to r obtained by obtaining the flexible exoskeleton heuristic_tAnd s_t+1；

The actuator converts the state into a process: (s)_t,a_t,r_t,s_t+1) Storing the parameters into an experience pool, and obtaining the parameters of the current time state and action after re-observation through a long short-term memory network

And will be

As a data set for training the online network. At the same time, willHeuristically derived(s)_t,a_t,r_t,s_t+1) And putting a reward function into the system, aiming at carrying out reward constraint, providing online strategy network and online Q network reference data and promoting the rapid convergence of the flexible exoskeleton parameters.

Claims

1. A method for establishing a reward function based on walking ratio trend changes is characterized by comprising the following steps:

D＝l(θ_max-θ_min) (1)

T(k)＝t_{trough of wave}(k)-t_{Trough of wave}(k-1) (2)

|W_{at present}-W_Target|＜|W_{Last sampling point}-W_Target| (4)

(6) when the reward function model is applied to a reinforcement learning algorithm for optimizing exoskeleton parameters, when the value function shown in formula (6) is the maximum, the obtained strategy is the optimal strategy, namely the adjustment of the walking ratio can be realized, so that the exoskeleton robot assists the old to walk and plays a role in rehabilitation;

v_π(s)＝E_π(R_t+1+γR_t+2+γ²R_t+3+...|S_t＝s) (6)

2. The method for establishing a reward function based on the trend change of the walking ratio as claimed in claim 1, wherein the curve of the flexion angle parameter of the hip joint of the wearer in the step (2) is obtained by:

(2-1) collecting hip joint flexion angle parameter signals of a wearer of the flexible exoskeleton robot by using an attitude sensor, converting the hip joint flexion angle parameter signals into digital quantity signals, sending the digital quantity signals to a single chip microcomputer, and sending the digital quantity signals to a PC (personal computer) end;

(2-2) acquiring the hip joint flexion angle parameter signal by using a serial port interface in MATLAB installed at a PC end, and drawing a hip joint flexion angle parameter real-time curve through a plot function.

3. The method according to claim 2, wherein in the step (2-1), the data transmission between the singlechip and the PC is that the singlechip transmits the data to the PC through a Bluetooth module via serial communication and wireless network.

4. The method as claimed in claim 2, wherein the real-time curve of the hip flexion angle parameter in step (2-2) can also be displayed directly by using third-party host computer software.

5. The method as claimed in claim 4, wherein the third-party host computer software is an anonymous host computer.

6. The method for establishing the reward function based on the walking ratio trend changes as claimed in claim 1, wherein the step (1) and the step (2) are implemented by a MEMS attitude sensor with an ADC conversion module for acquiring the hip flexion angle parameter signals of the wearer of the flexible exoskeleton robot.

7. The method for creating a reward function based on walking ratio trend changes according to claim 1, wherein the scoring of the sequence values in the analysis sequence according to the scoring mechanism in the step (4) specifically comprises:

indicating a trend in walking ratio over multiple cycles.