CN113392798B

CN113392798B - Multi-model selection and fusion method for optimizing motion recognition precision under resource limitation

Info

Publication number: CN113392798B
Application number: CN202110729963.5A
Authority: CN
Inventors: 张兰; 李向阳; 刘梦境
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2022-09-02
Anticipated expiration: 2041-06-29
Also published as: CN113392798A

Abstract

The invention discloses a multi-model selection and fusion method for optimizing action recognition precision under resource limitation, which belongs to the field of intelligent perception and multi-mode fusion and comprises the following steps: step 1, modeling resource limiting parameters and resource parameters of a single action recognition model; step 2, training an actor-critic reinforcement learning model to obtain an actor network serving as an online selection model and a critic network serving as a value scoring model; and 3, operating corresponding models according to the model combination, and fusing the recognition results of the models to serve as final recognition results. The method has the advantages that strictly orthogonal resource constraint can be processed, data in various modes and various models can be fused and utilized, and compared with a direct end-to-end fusion mode, the method can achieve higher precision under the condition of lower resource occupation. The method can be applied to action recognition in a multi-modal environment when resources are limited, such as scenes of smart home, patient care, unmanned driving and the like.

Description

Multi-model selection and fusion method for optimizing motion recognition precision under resource limitation

Technical Field

The invention relates to the field of intelligent behavior perception, in particular to a multi-model selection and fusion method for optimizing motion recognition accuracy under resource limitation.

Background

With the development of intelligent sensing equipment and artificial intelligence recognition technology, intelligent behavior sensing is receiving more and more attention. For multi-modal perception scenes, such as smart homes, patient nursing, unmanned driving and other scenes, recognition results of multiple models are fused to improve recognition accuracy of multi-modal perception data collected in the scenes, so that opportunities are brought, and meanwhile, new challenges are provided.

The existing intelligent behavior perception methods are mainly divided into the following methods: 1) a method aiming at improving precision; 2) to balance resource consumption and accuracy. The former method focuses on the final recognition accuracy without considering the overhead of resources. The latter approach takes into account resource constraints, such as device occupancy, energy consumption, and the like.

However, the existing method for balancing resource consumption and precision only qualitatively considers energy consumption, but does not consider more strict and quantitative resource limitations, such as memory occupation, time delay caused by calculation time, and the like. In addition, the existing method for balancing resource consumption and precision only involves the fusion of two models or two modalities, and the fusion of more than two models or modalities is not realized.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a multi-model selection and fusion method for optimizing action recognition accuracy under resource limitation, which can solve the problems that the existing intelligent perception recognition method for balancing resource consumption and accuracy does not consider more strict quantitative resource limitation and only involves two models or two modes for fusion.

The purpose of the invention is realized by the following technical scheme:

the embodiment of the invention provides a multi-model selection method for optimizing action recognition precision under resource limitation, which comprises the following steps:

step 1, resource limiting parameter modeling and resource parameter modeling of each action recognition model:

modeling a resource limiting parameter determined according to processing resources into a total rectangular model, wherein the total memory limiting parameter of the resource limiting parameter is used as the length of the total rectangular model, and the total delay limiting parameter of the resource limiting parameter is used as the width of the total rectangular model;

modeling a resource parameter of each action recognition model in an action recognition model library into a sub-rectangular model, wherein a memory parameter of the resource parameter is used as the length of the sub-rectangular model, and a time delay parameter of the resource parameter is used as the width of the sub-rectangular model;

the length and the width of the sub-rectangular model are respectively smaller than those of the total rectangular model;

step 2, using an operator-critic reinforcement learning model as an online selection model, using multi-modal perception data aligned in time as an operator network for inputting and training the online selection model, operating each action recognition model in a model combination selected from the action recognition model library by the operator network, fusing the recognition results of each action recognition model to obtain a final recognition result, and judging whether the final recognition result is correct or not by comparing the final recognition result with an actual data label;

taking a model combination output by the multi-modal perception data and the operator network as a critic network for inputting and training an operator-critic reinforcement learning model to obtain the value of the current model combination;

utilizing the resource limiting parameter modeling of the step 1 and the resource parameter modeling of each action recognition model to judge whether the model combination exceeds the resource limiting parameter;

calculating a reward function by combining whether the final identification result is correct and whether the model combination exceeds the resource limit, and updating the parameters of the operator network and the critic network based on a gradient descent method according to the reward function;

step 3, online action recognition:

inputting multi-modal perception data to be recognized into a trained online selection model, outputting a model combination by the online selection model, operating each action recognition model in the model combination and fusing the recognition result of each action recognition model to obtain a final recognition result.

According to the technical scheme provided by the invention, the multi-model selection method for optimizing the action recognition accuracy under the resource limitation provided by the embodiment of the invention has the beneficial effects that:

by modeling the resource limitation parameters into a total rectangular model, whether the model combination selected by the online selection model meets the resource limitation or not can be conveniently judged by using a rectangular packing mode, and then the optimal model combination can be obtained under the resource limitation, and the recognition precision after multi-model fusion is optimized. The coincidence can dynamically fuse various models and modes under the limitation of memory and time, and the action recognition precision is optimized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraints according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of resource constraint parameter modeling provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a system for collecting multimodal data according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an online identification method according to an embodiment of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the specific contents of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to a person skilled in the art.

Referring to fig. 1, an embodiment of the present invention provides a multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraint, including:

the length and the width of the sub-rectangular model are respectively smaller than the length and the width of the total rectangular model;

step 3, online action recognition:

In the method, the loss function of the operator network is the negative of the average value of the critic network output; the penalty function of the critic network is the mean square error of the value it outputs and the reward value of the reward function calculated subsequently.

As described aboveIn the step 2 of the method, a reward function is calculated by combining the correctness of the final recognition result and whether the model combination exceeds the resource limit parameter in the following mode, wherein the reward function is as follows:

the reward function r ∈ [0,1] includes: whether the model combination exceeds the resource limit rs is belonged to {0,1 }; and whether the final identification result of the model combination is correct re ∈ {0,1 }.

In the above method, determining whether the model combination exceeds the resource limit in the following manner includes:

judging whether resource parameters of each action recognition model of the model combination correspond to sub-rectangular models or not, putting the sub-rectangular models into a modeling total rectangular model corresponding to the resource limitation parameters established in the step 1, if the sub-rectangular models can be put into the modeling total rectangular model, determining that the model combination does not exceed the resource limitation, namely rs is 1, and if the sub-rectangular models cannot be put into the modeling total rectangular model, determining that the model combination exceeds the resource limitation, namely rs is 0;

and if the final identification result is consistent with the actual data label comparison, determining that the final identification result is correct, wherein re is 1, and otherwise, re is 0.

In the method, whether the resource parameters of each action recognition model of the model combination can be correspondingly divided into the rectangular models or not is judged through a rectangular packing algorithm, and the rectangular models are put into the modeling total rectangular model corresponding to the resource limiting parameters established in the step 1.

The combination of models resulting from steps 2 and 3 of the above method includes the selected plurality of models and the weight of each motion recognition model. The combination of models corresponds to a subset of models of the motion recognition model library. The weight of each action recognition model is automatically distributed by an operator network according to the reward of the critic network in the training and learning process.

In the method, the multi-modal sensing data is data sensed by multiple sensors to be identified.

In step 2 of the above method, the recognition results of the motion recognition models are fused in a weighted manner according to the weights of the motion recognition models in the model combination.

The method can select the optimal model combination meeting the resource limitation condition on line under the limitation of memory and time, and optimize the action recognition precision by dynamically fusing various models and modes.

The embodiments of the present invention are described in further detail below.

Referring to fig. 1, an embodiment of the present invention provides a multi-model selection method for optimizing motion recognition accuracy under resource constraint, including the following steps:

step 1, modeling a resource limitation parameter based on a rectangular packing algorithm: modeling the resource limiting parameter into a total rectangular model, taking the memory limiting parameter of the resource limiting parameter as the length of the total rectangular model, and taking the time delay limiting parameter of the resource limiting parameter as the width of the total rectangular model;

modeling the resource parameter of each action recognition model in the model library into a sub-rectangular model, wherein the memory parameter of the resource parameter is used as the length of the sub-rectangular model, and the time delay parameter of the resource parameter is used as the width of the sub-rectangular model;

the length and the width of the established sub-rectangle model are respectively smaller than those of the total rectangle model, namely the total rectangle model is a large rectangle, and the sub-rectangle models are small rectangles, so that the resource constraint is converted into whether a plurality of selected small rectangles can be placed in the large rectangle (see figure 2), and the small rectangles cannot rotate and cannot be overlapped;

step 2, taking an operator-critic reinforcement learning model as an online selection model, training the online selection model to select a model combination from a model library online, specifically:

step 21) training an actor network and a critic network of the actor-critic reinforcement learning model, wherein the input of the actor network is multi-modal perception data, and the output of the actor network is a model combination (comprising the selected models and the weight of each model); the input of the critic network is a model combination of multi-mode perception data and operator network output, and the output is the value of the current model combination; the loss function of the actor network is the negative of the mean of the values of the critic network output; the loss function of the critic network is the mean square error of the value output by the critic network and the reward value of the reward function obtained by subsequent calculation; the two networks update network parameters based on a gradient descent method;

step 22) adopting the following reward function feedback to update the operator network and the criticc network, wherein the reward function r belongs to [0,1]]The method comprises the following two aspects: whether the model combination exceeds the resource rs belongs to {0,1 }; whether the recognition result fused with each model in the model combination belongs to the re E {0,1} is correct or not, and the reward function is specifically as follows:

step 3, operating corresponding action recognition models according to the model combination output by the online selection model, and fusing the recognition result of each action recognition model to obtain a final recognition result;

and 4, calculating the reward function in the step 22 according to each action recognition model in the model combination and the final fusion result, recording tuples consisting of input data, the model combination and the reward value, and updating the parameters of the operator network and the critic network based on a gradient descent method by using a historical record.

The method of the invention models two-dimensional resource constraint, dynamically selects and fuses a plurality of models by online selecting model combination in the steps, and improves the identification precision of multi-model fusion.

Examples

(1) Early preparation:

1a) collect multimodal perception data (see fig. 3), including: the method comprises the following steps of (1) smart phone acceleration sensor data, wifi data and sound data;

1b) respectively training a plurality of action recognition models in an action recognition model library by using perception data of different modes, wherein the method comprises the following steps: SVM, xgboost, LSTM model, and measuring the recognition accuracy, memory occupation and recognition time delay of the model;

(2) training an online selection model:

21) time alignment is carried out on the collected multi-modal perception data, and the multi-modal perception data are used for training an online selection model;

22) training an operator-critic reinforcement learning model serving as an online selection model by using multi-modal perception data aligned in time, and obtaining a model combination from a model library; a model is combined with a plurality of models, and each model is attached with a weight. The action recognition model library is provided with a plurality of action recognition models, and the model combination only comprises a plurality of action recognition models, and can be regarded as a subset of the model library. In addition, each motion recognition model in the model combination has a weight as the weight for the recognition result of the subsequent fusion. For example: the model library has 3 models, and the model combination may be one vector (a, b, c), where a, b, c represent motion recognition models one, two, three, respectively, where a, b, c take values between 0 and 1, and represent weights of the models one, two, three, and if a ═ 0 represents that the model one is not selected, a ═ 0.5 represents that the weight of the model one is 0.5.

23) Operating each model in the model combination, and fusing the recognition results of each model to obtain a final recognition result;

24) judging whether the final recognition result is correct or not by comparing the final recognition result with the actual data label; judging whether the model combination exceeds the resource limit or not through modeling in the step 1, calculating a reward function by combining the two conditions, and updating the parameters of the operator network and the critic network based on a gradient descent method;

(3) online action recognition (see fig. 4):

inputting multi-modal perception data into a trained online selection model, outputting a model combination by the online selection model, operating each action recognition model in the model combination and fusing the recognition result of each action recognition model to obtain a final recognition result.

The method of the invention utilizes the rectangular packing modeling condition limitation, and solves the problem of strict quantitative resource limitation; the model is designed on line based on the operator-critical framework, so that the problem that the label of the model combination is not unique and is difficult to obtain is solved; and optimizing the final recognition precision by dynamically selecting model combinations on line. The method has the advantages that strict orthogonal resource constraint can be processed, perception data in various modes and various models can be fused and utilized, and compared with a direct end-to-end fusion mode, the method can achieve higher precision under the condition of lower resource occupation. The method can be applied to action recognition in a multi-modal environment when resources are limited, such as scenes of intelligent home, patient nursing, unmanned driving and the like.

Those of ordinary skill in the art will understand that: all or part of the processes of the methods for implementing the embodiments may be implemented by a program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A multi-model selection and fusion method for optimizing motion recognition accuracy under resource limitation is characterized by comprising the following steps:

step 3, online action recognition:

2. The method for multi-model selection and fusion of motion recognition accuracy under resource constraints of claim 1, wherein the loss function of the actor network is the negative of the mean of the values of the critic network output; the penalty function of the critic network is the mean square error of the value it outputs and the reward value of the reward function calculated subsequently.

3. The method for selecting and fusing multiple models to optimize accuracy of motion recognition under resource constraints as recited in claim 1, wherein determining whether a combination of models exceeds a resource constraint comprises:

judging whether resource parameters of each action recognition model of the model combination correspond to sub-rectangular models or not, putting the sub-rectangular models into a modeling total rectangular model corresponding to the resource limitation parameters established in the step 1, if the sub-rectangular models can be put into the modeling total rectangular model, determining that the model combination does not exceed the resource limitation, and if the sub-rectangular models cannot be put into the modeling total rectangular model, determining that the model combination exceeds the resource limitation;

and if the final identification result is consistent with the actual data label comparison, determining that the final identification result is correct.

4. The multi-model selection and fusion method for optimizing action recognition accuracy under resource constraints of claim 3, wherein a rectangular packing algorithm is used to determine whether resource parameters of each action recognition model of a model combination can be divided into rectangular models, and the rectangular models are put into the total rectangular model for modeling corresponding to the resource constraint parameters established in step 1.

5. The method for selecting and fusing multiple models to optimize motion recognition accuracy under resource constraints according to claim 1 or 2, wherein the combination of models obtained in step 2 and step 3 comprises a plurality of selected motion recognition models and a weight of each motion recognition model.

6. The multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraints according to claim 1 or 2, wherein the multi-modal perception data is multi-sensor perception data to be recognized.

7. The multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraints according to claim 1 or 2, wherein in the step 2, the recognition results of the motion recognition models are fused in a weighted manner according to the weights of the motion recognition models in the model combination.