CN116975695B

CN116975695B - Limb movement recognition system based on multi-agent reinforcement learning

Info

Publication number: CN116975695B
Application number: CN202311104223.8A
Authority: CN
Inventors: 姬冰; 潘庆涛; 王昊
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2023-08-30
Filing date: 2023-08-30
Publication date: 2024-03-19
Anticipated expiration: 2043-08-30
Also published as: CN116975695A

Abstract

The invention provides a limb movement recognition system based on multi-agent reinforcement learning, which is used for acquiring potential movement signals outside a sliding window by introducing a teacher auxiliary network, combining auxiliary actions output by the teacher auxiliary network and strategy actions output by an agent with guiding and adjusting the sliding window to search for optimal movement signals, so that the unidirectional search caused by the fact that the existing agent only learns the state of the current sliding window is avoided.

Description

Limb movement recognition system based on multi-agent reinforcement learning

Technical Field

The invention belongs to the technical field of motion recognition, and particularly relates to a limb motion recognition system based on multi-agent reinforcement learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

In intensive care units, delirium and agitation of the patient occur, which can have adverse consequences for the patient if not found in time. The camera in the intensive care unit can monitor the movement condition of the patient in real time, and as the movement of the patient is mainly represented on the limbs, the analysis of the movement condition of the limbs of the patient to identify delirium of the patient is a good scheme. However, since the movements of the limbs of the patient are very irregular and delirium occurs randomly in each limb, it is generally difficult to capture the motion segment that best embodies the delirium motion, and a technique is required to extract the most valuable motion segment from the original motion sequence. At the same time, since there is a similarity between delirium and normal movements and a large difference between different delirium movements, this brings about a large disturbance of delirium movement identification.

Multi-agent reinforcement learning can be used as an effective paradigm to extract optimal motion segments from the limb movements of a patient. Specifically, a sliding window is initialized on the motion signal of each limb, each agent learns the motion state corresponding to the sliding window in each limb motion signal, and the agent outputs actions according to the learned states, wherein the actions are used for adjusting the sliding window until the optimal motion segment is explored. Because the four-limb movement signals of the patient are used in the work, four intelligent agents are used for jointly extracting the optimal movement fragments of the limbs. However, existing multi-agent reinforcement learning algorithms suffer from two limitations: (1) Existing multi-agent reinforcement learning algorithms only learn the state of the current sliding window, the agents output random actions to explore the state space, which is undirected, and it is difficult to accurately find the optimal state related to the task. (2) Multi-agent reinforcement learning uses team rewards to optimize the policies of each agent, which results in limitations in quantifying the contribution of each agent. This, therefore, results in a lack of collaboration between multiple agents, which is critical to team success.

In addition, the multi-domain feature driven ensemble learning is beneficial to learning fine-grained features which are difficult to mine in time domain motion signals, because the ensemble learning can integrate complementary classifiers learned in the multi-domain features to jointly characterize motion signals in the time domain, thereby learning the fine-grained features from a multi-knowledge perspective and improving the discrimination capability of motion which cannot be classified. However, ensemble learning performance is highly dependent on the accuracy and diversity of the classifier, which requires a substantial tradeoff between these two conflicting conditions.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a limb movement recognition system based on multi-agent reinforcement learning, which is introduced with a teacher auxiliary network, wherein the teacher auxiliary network is used for pre-sensing the direction of an optimal movement segment, so that the agent considers the auxiliary action of the teacher auxiliary network in exploration, thereby avoiding undirected exploration.

To achieve the above object, a first aspect of the present invention provides a limb movement recognition system based on multi-agent reinforcement learning, comprising:

an acquisition module configured to: acquiring original motion signals of different limbs;

an agent module configured to: constructing intelligent body individuals corresponding to different limbs, and learning original movement signals of the limbs in the sliding window based on a reinforcement learning algorithm by utilizing the intelligent body individuals to obtain strategy actions;

a teacher assist module configured to: respectively learning original motion signals of different limbs outside the sliding window by using a teacher auxiliary network to obtain auxiliary actions;

a exploration module configured to: adjusting the sliding window based on the obtained strategy actions and auxiliary actions of different limbs to obtain optimal motion signals of different limbs;

a multi-domain feature module configured to: respectively encoding according to the original motion signals and the optimal motion signals of different limbs to obtain multi-domain characteristics;

an identification module configured to: and inputting the obtained multi-domain features into a trained classifier to obtain a recognition result.

A second aspect of the present invention provides a computer apparatus comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of:

acquiring original motion signals of different limbs;

constructing intelligent body individuals corresponding to different limbs, and learning original movement signals of the limbs in the sliding window based on a reinforcement learning algorithm by utilizing the intelligent body individuals to obtain strategy actions;

respectively learning original motion signals of different limbs outside the sliding window by using a teacher auxiliary network to obtain auxiliary actions;

adjusting the sliding window based on the obtained strategy actions and auxiliary actions of different limbs to obtain optimal motion signals of different limbs;

respectively encoding according to the original motion signals and the optimal motion signals of different limbs to obtain multi-domain characteristics;

and inputting the obtained multi-domain features into a trained classifier to obtain a recognition result.

A fourth aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:

acquiring original motion signals of different limbs;

The one or more of the above technical solutions have the following beneficial effects:

in the invention, a teacher auxiliary network is introduced for acquiring potential motion signals outside the sliding window, and auxiliary actions output by the teacher auxiliary network and strategy actions output by the intelligent agent are combined with guiding and adjusting the sliding window to search for the optimal motion signals, so that the unidirectional search caused by the fact that the existing intelligent agent only learns the state of the current sliding window is avoided.

In the invention, a team-intrinsic rewarding mechanism is provided to quantify contributions among multiple agents in the current potential double-view guided multi-agent reinforcement learning, and a diversified rewarding distribution is constructed for team-person collaborative representation, so that the multiple agents are motivated to realize excellent team collaboration through diversified rewards related to the contributions.

In the invention, the classifier is trained by utilizing the extracted time domain features, namely global local features, frequency domain features, namely discrete wavelets, image domain features, namely recursion diagrams, gram angle fields and other multi-domain features, so that the original motion can be jointly represented from the angles of multiple fields, and fine-grained features can be learned by integrating the optimal classifier combination, thereby better distinguishing the motions of different categories.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a flow chart of limb movement recognition based on multi-agent reinforcement learning in accordance with an embodiment of the present invention;

FIG. 2 is an overall frame diagram of a limb movement recognition system based on multi-agent reinforcement learning in accordance with an embodiment of the present invention;

FIG. 3 is a diagram of a teacher-aided strategy algorithm in multi-agent reinforcement learning in accordance with an embodiment of the present invention;

fig. 4 is a team-in rewards mechanism in multi-agent reinforcement learning in accordance with a first embodiment of the invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

The embodiment discloses a limb movement recognition system based on multi-agent reinforcement learning, comprising:

The general inventive concept of this embodiment: the current potential double-view guided multi-agent reinforcement learning is to embed the current view into a strategy network, and simultaneously embed the potential view into a teacher auxiliary network designed in the embodiment, wherein the teacher auxiliary network is used for pre-sensing the direction of the optimal movement fragment. Because of this efficient state prior exploration, the agent will consider the representation of the potential state when performing the decision, thereby avoiding undirected exploration. To facilitate collaboration among multi-agents, the present embodiments further provide a team-in rewards mechanism to quantify contributions among multi-agents in current potential dual-view guided multi-agent reinforcement learning, building a diversified rewards distribution for team-person collaborative representations, thereby motivating the multi-agents to achieve superior team collaboration through the diversified rewards related to the contributions.

The multi-domain feature driven self-contrast ensemble learning is used for comparing and selecting the optimal classifier combination from the candidate classifier combinations by self-evaluating the integration performance of the candidate classifier combinations, so that the full trade-off between the diversity and the performance of the classifier is realized. Where the classifier is learned from multi-domain features, including temporal features (original motion), frequency domain features (discrete wavelets) and image domain features (recursive and gladhand angular fields), this helps to jointly characterize the original motion from a multi-domain perspective. It learns fine-grained features by integrating optimal classifier combinations to better distinguish between different classes of motion.

In this embodiment, the limb movement recognition based on multi-agent reinforcement learning specifically includes:

step 1: and identifying skeleton points of the patient in the video and calculating motion signals of the skeleton points of the limbs.

Step 2: the sliding windows are randomly initialized on the limb movement signals of the patient, respectively.

Specifically, sliding windows are randomly placed on the motion signals of the four limbs respectively, and the optimal motion segments are found through subsequent reinforcement learning.

Step 3: and training four intelligent agents to respectively learn the current motion states corresponding to the sliding windows in the four-limb motion signals, and outputting strategy actions.

Specifically, the intelligent agent is a neural network formed by full-connection layers, learns the current motion state of the sliding window corresponding to the intelligent agent, and outputs strategy actions for adjusting the sliding window to explore the optimal motion segment.

Step 4: and training four teacher auxiliary networks to respectively learn potential motion states except the sliding window in the four-limb motion signals and output auxiliary actions.

It is contemplated that learning the current state of motion with only the agent may make the action of the agent output random. Therefore, a potential view pre-sensing strategy is introduced, namely, a teacher is constructed to assist the network to learn potential views around the current motion state so as to pre-sense the direction of the optimal motion segment and output auxiliary actions so as to assist to guide the intelligent agent to output reliable actions, and therefore the optimal motion segment is found more accurately.

As shown in fig. 3, the teacher auxiliary network is a neural network constructed by full connection layers, and states loss and state transition loss are constructed to train the teacher auxiliary network, the states loss reflects the fluctuation degree of the explored potential view, and the state loss function L _state The method comprises the following steps:

L _state ＝f _std (x ₁ ,x ₂ ,...,x _k ) (1)

wherein f _std (. Cndot.) is a standard deviation function, x _k Is the value of each point in time of the explored potential view.

State transition loss reflects the potential view S of the current exploration _t Potential perspective S with previous exploration _{t_1} The difference between them to promote the wide exploration of the optimal motion segment, state transition lossThe method comprises the following steps:

the purpose of the teacher-aided network is to observe the potential views for direction pre-perception of the optimal motion segment, so the optimization goal is to jointly maximize L _state Andthe common loss can be formulated as:

wherein lambda is ₁ And lambda (lambda) ₂ Respectively is L _state Andis to minimize L using gradient descent algorithm _joint I.e. maximize L _state And->

Based on each iteration result of the teacher auxiliary network, gaussian distribution of each action category is constructed, the most appropriate action is selected as the auxiliary action, and the constraint strategy is used for evaluating the reliability of the auxiliary action.

Specifically, the input of the teacher auxiliary network is a potential view (original motion signal outside the sliding window), the output of the teacher auxiliary network is the prediction probability of three types of motion categories, namely, the probability of left shift, right shift and no shift, that is, each iteration result of the teacher auxiliary network, the motion with the highest probability is selected to act on the sliding window of the potential view, and the motion with the highest probability is used for adjusting the position of the potential view.

For Gaussian distribution, the above describes that the iteration result of each time is the prediction probability of three actions, and after n iterations are performed on the teacher auxiliary network, n groups of action probabilities are generated, namely n×3,3 is the action category number, namely left shift, right shift and no shift. The probability is predicted for each action class for n iterations, i.e., n 1, and a gaussian distribution is constructed. This results in three gaussian distributions after n iterations, each gaussian representing the gaussian distribution of each action class. The constructed gaussian distribution, i.e. the gaussian distribution of each action class, reflects the mean and variance of the class of actions, where the mean reflects the overall prediction probability of the class of actions and the variance reflects the uncertainty of the action, because the larger the variance, the larger the variation of the action. Based on the mean and variance of the gaussian distribution of the three categories, the most appropriate action is selected as an auxiliary action to guide the decision of the agent.

Initializing a sliding window of a potential view on an original motion signal when the teacher auxiliary network iterates for the first time, wherein the sliding window of the potential view and the sliding window of the current view in the first iteration are corresponding, the motion signal in the sliding window in the first iteration is sent into the teacher auxiliary network, the teacher auxiliary network outputs three actions, namely the probability that the network outputs three categories through a softmax layer, and then randomly selecting one action from the three actions to adjust the sliding window of the potential view. At the second iteration, the input to the teacher-assist network is a sliding window of potential views that are adjusted, i.e., potential views that are adjusted by the action output from the first iteration. The input of each iteration is the signal in different potential view sliding windows, namely the input of the current iteration of the teacher auxiliary network is adjusted by the output action of the teacher auxiliary network in the previous iteration, and the output action probability of the teacher auxiliary network in each iteration is reserved, so that n groups of action probabilities are obtained.

The gaussian distribution is constructed after n iterations are completed, and is constructed by n sets of action probabilities. Specifically, there are three actions, left-shift, no-shift, right-shift, when the sliding window is adjusted. At each iteration, these three actions have different probabilities. After n iterations, the probability of n left shifts constructs Gaussian distribution; constructing Gaussian distribution by n non-moving probabilities; the n right-shifted probabilities construct a gaussian distribution.

Step 5: based on the strategy action and the auxiliary action, an action constraint strategy is constructed, a guiding action is obtained to adjust a sliding window, and the optimal fragment exploration is carried out.

The auxiliary actions output by the teacher auxiliary network are not necessarily appropriate and further constraints need to be placed on the guidance of the agent, meaning that the agent will give up advice from the teacher network when the auxiliary actions output by the teacher auxiliary network are not sufficiently confident.

Specifically, constraint function f _cons The auxiliary actions are compared with the policy actions from the policy network, i.e. the agent, to determine the reliability of the auxiliary actions, thereby constructing a guiding action for agent decision. The auxiliary actions are defined as:

where j is the action index, μ _j Is an expected value of motion, which reflects motionThe overall probability of j, sigma _j Is the standard deviation of the gaussian distribution, which reflects the uncertainty of action j.

Constraint function f _cons The definition is as follows:

a _guided ＝f _cons (a _auxiliary ,a _policy ) (6)

wherein a is _policy Is an action from the policy network i.e. the agent,is the variance of the helper action gaussian distribution and ζ is the helper action confidence value. From equation (5), when +.>When larger, the agent does not take an auxiliary action, but rather selects a policy action. This means that the intelligent agent will discard the teacher's network advice when the auxiliary action output by the teacher's auxiliary network is not sufficiently confident.

Step 6: the team rewards of four agents at each exploration and the intrinsic rewards of each agent are maximized, so that the multi-agent joint exploration of the optimal segment of the limb movement signals is restrained.

In order to evaluate the effect of multi-agent reinforcement learning in the process of searching the optimal exercise fragments, the currently searched four-limb exercise fragments and the original exercise signals are sent to a group network for training, and team rewards are built based on the test results, and the better the test results are, the larger the team rewards are. In order to quantify the contribution of each agent in exploring the optimal segment, an individual evaluation mechanism calculates the signal fluctuation degree of the individual state and the training effect of the agent on the individual state, and obtains state rewards and individual rewards, wherein the two rewards form an intrinsic reward. When optimizing multi-agent reinforcement learning, a gradient ascending strategy is adopted to maximize team rewards and corresponding intrinsic rewards of each agent.

Specifically, as shown in FIG. 4, the team-intrinsic rewards include: team rewards generated by the swarm network and intrinsic rewards generated by the individual assessment mechanism to reflect the overall movement state of the limbs during the exploration and to assess the individual state of each limb.

Team rewards are generated from currently explored limb movement segments and global movement signals of limbs for evaluating team performance of multiple agents. And simultaneously inputting the currently explored movement fragments and the global movement signals into a group network, training the group network and outputting the prediction probability of the test set so as to evaluate the currently explored movement fragments of the limbs.

The group network consists of 8 encoders and a classifier, wherein the encoders are convolution networks, and the classifier consists of a full-connection layer. The group network has 8 inputs, namely global motion signals of limbs and motion fragments of the limbs in the process of exploring by multiple intelligent agents, the 8 inputs are respectively sent to 8 encoders, 8 features output by the encoders are connected, and the 8 features are sent to a classifier for classification, so that the prediction probability is obtained.

For the test set, each sample consists of 8 inputs, i.e., the 8 inputs mentioned above, one label for each sample. The tags are in 3 categories, namely, agitation movement, normal movement and movement under calm, namely, basically motionless.

The team rewards function calculates team rewards based on the predicted probability of the group network on the test set, the formula of which is defined as:

R _i ＝f _team (x _i ) (8)

wherein x is _i Is sample data, i.e. input data of group network, p { y|x _l The probability of each tag is }, and m is the number of tags. P { y } _true |x _i The value of the predicted real label is n, and the number of samples is n.

The proposed individual assessment mechanism is used to generate intrinsic rewards to quantify the contribution of each agent, facilitating collaboration between multiple agents. The state reward function evaluates the degree of fluctuation of the individual's state to determine whether the currently explored piece of motion is representative, thereby generating a state reward. The individual agents learn the individual status of their action changes to determine if the current decision is reliable, thereby generating individual rewards. The status rewards function may be calculated by:

where std (·) is the standard deviation function, z is the agent index, s ^z Is the state of the z-th agent.

Individual rewards are obtained by prediction of their status by individual agents. When the predicted individual status label changes from false to correct after one iteration, a strong reward Ω is given; if it goes from correct to incorrect, it is strongly penalized by-omega. Thus, the final form of the personal reward may be written as:

wherein Ω is a positive integer, r ₀ Is a reward when the predicted status tag has not changed.

Thus, r will be ₀ The definition is as follows:

r ₀ ＝sgn(p _e,c -p _e-1,c ) (12)

wherein sgn (·) is a sign function, thus rewarding r ₀ Take the value of { -1,1 }. c is the true label of the individual's status. P is p _e,c Representing the probability that the individual state is category c at e iterations.

The total rewards for each agent is the sum of team rewards and their corresponding intrinsic rewards, ultimately used to optimize the strategy for each agent. The total rewards are:

maximizing total rewards for each agentTo learn the parameter θ of its policy function.

The derivative of the objective function J (θ) may be approximated as:

where E represents the E-th step and E represents the total number of training steps. a and s represent actions and states, respectively. θ ^z Is a parameter of agent z, which can be updated with a learning rate:

through iterative training, the total rewards of each agent are maximized to extract the optimal motion segment from the global motion signal.

Step 7: and respectively obtaining the optimal motion segment in the motion signal of each limb.

Step 8: and sending the optimal motion segment and the original motion of the limbs into a multi-channel encoder, and extracting global local features.

And respectively sending the optimal motion segments of the four limbs into four convolution networks to obtain the local characteristics of the motion signals. And sending the original motion signals of the four limbs into four convolution networks respectively to obtain the global characteristics of the motion signals. And cascading the global features and the local features to obtain global local features. This characterizes both the time dependence and the locally varying nature of the motion signal.

Step 9: the coded global local features are discrete wavelets, recursive diagrams and gladhand corner fields, constructing multi-domain features.

The global local features in the time domain are encoded as frequency domain features, i.e. discrete wavelets, image domain features, i.e. recursive diagrams and gladhand angle fields, which facilitate learning the features of the motion signal from a multi-domain perspective.

Step 10: four basic classifiers are trained based on global local features, discrete wavelets, a recursive graph, and a glamer angle field, respectively.

The global local features are directly sent to the full connection layer to obtain a classification result. The discrete wavelet, the recursion diagram and the gram angle field are respectively sent into three convolution networks to obtain three groups of classification results; thus, classification results of four basic classifiers corresponding to the four domain features are obtained.

Step 11: the weights of the global local features, the discrete wavelets, the recursive diagram and the gram angle field are updated according to the performance of each classifier.

The weight of each sample in each domain is updated according to the classification of each sample in each domain. When samples are misclassified in the base classifier, their weights will increase, while the weights of correctly classified samples will decrease. Accordingly, the retraining process focuses more on those samples that were misclassified.

Specifically, the weight of each sample in each feature domain is adjusted based on the prediction results of the underlying classifier trained by the multi-domain features. The updated multi-domain features are trained, so that the retraining process focuses more on the erroneously classified samples to obtain an amplified classifier, and the diversity of the classifier is increased.

Step 12: four new classifiers are trained based on the multi-domain features of the updated weights to obtain four amplified classifiers to increase the diversity of the classifiers.

For a training dataset t= (F) with n sample features and three categories ₁ ,y ₁ ),(F ₂ ,y ₂ ),...,(F _n ,y _n ) Whereini＝1,2,...,n，/>The weights of the training data set may be initialized to represent three categories of motion, agitation, normal motion, and calm motion, respectively:

wherein F is _i The method refers to features which are extracted by 8 inputs of the optimal motion segment of the limb and the global motion signal of the limb through a group network and spliced, namely global local features.

The voting weights a of the base classifier T (x) are calculated.

The voting weight increases as the error rate decreases. That is, the superior base classifier is assigned a higher voting weight.

Updating weights of training data sets toThe weight update formula is:

wherein T (F) _i ) Refers to the predictive category of the feature of the ith sample, when y _i Is the true label of the i-th sample. When T (F) _i )＝y _i I.e. the feature prediction of the i-th sample is correct, the sample weight will decrease, and conversely, the sample weight will increase.

When samples are misclassified in the base classifier, their weights will increase, while the weights of correctly classified samples will decrease. Accordingly, the retraining process focuses more on those samples that were misclassified.

Step 13: and circularly screening the candidate basic classifier combinations until the integration performance of the final screened basic classifier combinations is greater than the threshold value, comparing all candidate basic classifier combinations and selecting the optimal basic classifier combination.

The classifier screening threshold is calculated based on the performance of each classifier of the current base classifier combination, classifiers with performance below the threshold are discarded, classifiers with performance above the threshold are retained, the retained classifier combination is considered as a candidate classifier combination, and the threshold is updated based on the retained classifier combination until each classifier performance of the final screened classifier combination is above its corresponding threshold. And comparing all candidate classifier combinations, and selecting the classifier combination with the optimal integration performance.

The self-contrast integrated learning process is as follows: for the basic classifier, an initial threshold is calculated according to the performances of all the basic classifiers, the initial threshold is used for screening high-performance classifiers, the screened classifiers form a candidate classifier combination 1, the threshold of the candidate classifier combination 1 is calculated, and the screened classifiers form a candidate classifier combination 2 until all the classifiers in the final candidate classifier combination are larger than the corresponding threshold. And comparing all candidate classifier combinations, and selecting the classifier combination with the optimal integration performance. The screening process of the optimal amplification classifier combination is the same as the basic classifier.

Specifically, for a classifier having p members [ c ] ₁ ,c ₂ ,...,c _p ]Taking the accuracy of the classifier as the corresponding weight of each classifier, thereby obtaining a weighted classifier combinationCalculating the integration Performance E of p classifiers _p And a screening threshold gamma ^p . And selecting the classifier with the accuracy higher than the screening threshold value from the p classifiers, calculating the integrated accuracy of the selected classifier, and carrying out iterative screening until the accuracy of each classifier of the last screened classifier combination is higher than the corresponding screening threshold value.

Wherein f _j (F _i ^j ) Is the predicted value of the ith sample feature of the jth classifier.Is the weight of the j-th classifier. Phi (F) _i ^j ) Representing the integrated predictors of the p classifiers. />Tag being integrated predictor +.>Is the number of correctly predicted samples in the integrated prediction result.

Step 14: and circularly screening the candidate amplification classifier until the integration performance of the finally screened amplification classifier combination is greater than the threshold value, comparing all candidate amplification classifier combinations and selecting the optimal amplification classifier combination.

Step 15: and integrating the optimal basic classifier combination and the optimal amplification classifier combination to obtain a classification result.

The method effectively relieves negative effects caused by random decision of the agents and a single team rewarding distribution mechanism under the traditional multi-agent reinforcement learning framework and neglecting trade-off between diversity and performance of the classifier in integrated learning.

Compared with the traditional multi-agent reinforcement learning, the method provided by the invention introduces a teacher auxiliary strategy for learning the potential view to presense the direction of the optimal state (optimal motion segment) in the state space, and introduces an intrinsic reward for quantifying the contribution of each agent to promote the cooperation among the multi-agents, so that the performance of the optimal state exploration is obviously improved. Compared with the traditional integrated learning, the self-comparison integrated learning algorithm is designed for realizing the sufficient trade-off between the diversity and the performance of the classifier.

Example two

It is an object of the present embodiment to provide a computing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program:

acquiring original motion signals of different limbs;

Example IV

An object of the present embodiment is to provide a computer-readable storage medium.

A computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:

acquiring original motion signals of different limbs;

The steps involved in the devices of the second and third embodiments correspond to those of the first embodiment of the method, and the detailed description of the embodiments can be found in the related description section of the first embodiment. The term "computer-readable storage medium" should be taken to include a single medium or multiple media including one or more sets of instructions; it should also be understood to include any medium capable of storing, encoding or carrying a set of instructions for execution by a processor and that cause the processor to perform any one of the methods of the present invention.

It will be appreciated by those skilled in the art that the modules or steps of the invention described above may be implemented by general-purpose computer means, alternatively they may be implemented by program code executable by computing means, whereby they may be stored in storage means for execution by computing means, or they may be made into individual integrated circuit modules separately, or a plurality of modules or steps in them may be made into a single integrated circuit module. The present invention is not limited to any specific combination of hardware and software.

While the foregoing description of the embodiments of the present invention has been presented in conjunction with the drawings, it should be understood that it is not intended to limit the scope of the invention, but rather, it is intended to cover all modifications or variations within the scope of the invention as defined by the claims of the present invention.

Claims

1. A limb movement recognition system based on multi-agent reinforcement learning, comprising:

the intelligent agent is a neural network consisting of all connection layers;

constructing a total rewarding function of an intelligent individual in the intelligent module, and learning and exploring an original motion signal by the intelligent individual by utilizing the constructed total rewarding function;

wherein the total rewards function includes a team rewards function and an intrinsic rewards function; designing a team rewarding function based on motion category prediction probabilities obtained by motion signals currently explored by all intelligent agent individuals and original motion signals of all limbs; designing an intrinsic reward function based on correctness of a strategy action obtained by an intelligent agent individual according to a motion signal in a current sliding window;

the teacher auxiliary network is a neural network constructed by a full-connection layer;

in the exploration module, constructing a constraint function according to the obtained strategy action and the corresponding auxiliary action, guiding and adjusting the sliding window, and carrying out optimal fragment exploration;

constructing a Gaussian distribution of the auxiliary action, constructing a constraint function through comparison of a variance of the Gaussian distribution of the auxiliary action and an auxiliary action confidence value, and guiding an intelligent agent individual whether to take the auxiliary action or not through the constraint function;

specifically, constraint function f _cons Comparing the auxiliary action with the strategy action from the strategy network, namely the intelligent agent, so as to judge the reliability of the auxiliary action, wherein the auxiliary action is defined as:

where j is the action index, μ _j Is the expected value of the action, which reflects the overall probability of action j, σ _j Is the standard deviation of the gaussian distribution, which reflects the uncertainty of action j;

constraint function f _cons The definition is as follows:

a _guided ＝f _cons (a _auxiliary ,a _policy )

wherein a is _policy Is an action from the policy network i.e. the agent,is the variance of the auxiliary action gaussian distribution and ζ is the confidence value of the auxiliary action;

2. The limb movement recognition system based on multi-agent reinforcement learning of claim 1, wherein in the teacher assisting module, a state loss function and a state transfer loss function are constructed to guide a teacher assisting network to explore movement signals outside the sliding window; constructing a state loss function according to the fluctuation degree of potential motion signals outside the sliding window currently explored by the teacher auxiliary network; and constructing a state transition loss function according to the difference between the potential motion signals outside the sliding window currently explored by the teacher-assisted network and the potential motion signals outside the sliding window explored in the previous step.

3. The limb movement recognition system based on multi-agent reinforcement learning according to claim 1, wherein in the multi-domain feature module, the optimal movement signals and the original movement signals of different limbs are input into a multi-channel encoder to obtain global local features; the global local features are encoded to obtain discrete wavelets, a recursive graph, and a gram angle field.

4. A multi-agent reinforcement learning based limb movement recognition system as recited in claim 3, further comprising: a classifier screening module configured to: respectively training different basic classifiers by using the obtained global local features, discrete wavelets, a recursion diagram and sample data of a gram angle field;

according to the prediction result of the basic classifier, the weight of each sample data is adjusted, and the global local feature, the discrete wavelet, the recursion diagram and the Graham angle field sample data after the weight is updated are respectively trained into different new classifiers to obtain a trained amplified classifier;

screening the basic classifiers based on performance to obtain candidate basic classifier combinations, calculating the integration performance of the candidate basic classifier combinations, and carrying out iterative screening on the candidate basic classifier combinations until the accuracy of each basic classifier in the screened basic classifier combinations is higher than a corresponding screening threshold value to obtain optimal basic classifier combinations;

screening the amplification classifier based on performance to obtain candidate amplification classifier combinations, calculating the integration performance of the candidate amplification classifier combinations, and carrying out iterative screening on the candidate amplification classifier combinations until the accuracy of each amplification classifier in the screened amplification classifier combinations is higher than a corresponding screening threshold value to obtain optimal amplification classifier combinations;

and identifying according to the obtained optimal basic classifier combination and the optimal amplification classifier combination.

5. The limb movement recognition system based on multi-agent reinforcement learning according to claim 2, wherein in the teacher assisting module, in each iteration of the teacher assisting network, a gaussian distribution of each movement direction is constructed according to the prediction probability of the movement direction obtained by the currently explored movement signal, and an optimal movement is selected as an assisting movement according to the mean and variance of the constructed gaussian distribution.

6. The limb movement recognition system based on multi-agent reinforcement learning according to claim 1, wherein in the agent module, a limb movement signal and an original movement signal explored by a current agent are input into a constructed group network, and a group rewarding function is constructed according to a movement category probability correspondingly output by the group network.

7. A computer device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the computer device is running, the machine-readable instructions when executed by the processor performing the steps of:

acquiring original motion signals of different limbs;

the intelligent agent is a neural network consisting of all connection layers;

constructing a total rewarding function of an intelligent individual in the intelligent module, and learning and exploring the original motion signal by the intelligent individual by utilizing the constructed total rewarding function;

based on the strategy action and the auxiliary action, constructing an action constraint strategy, obtaining a guiding action to adjust a sliding window, and carrying out optimal fragment exploration;

constructing a Gaussian distribution of the auxiliary action, and constructing a constraint function through comparison of a variance of the Gaussian distribution of the auxiliary action and an auxiliary action confidence value, and guiding an intelligent agent individual whether to take the auxiliary action or not through the constraint function;

constraint function f _cons The definition is as follows:

a _guided ＝f _cons (a _auxiliary ,a _policy )

8. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, the computer program when executed by a processor performs the operations of:

acquiring original motion signals of different limbs;

the intelligent agent is a neural network consisting of all connection layers;

constraint function f _cons The definition is as follows:

a _guided ＝f _cons (a _auxiliary ,a _policy )