CN113392798A

CN113392798A - Multi-model selection and fusion method for optimizing motion recognition precision under resource limitation

Info

Publication number: CN113392798A
Application number: CN202110729963.5A
Authority: CN
Inventors: 张兰; 李向阳; 刘梦境
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2021-06-29
Filing date: 2021-06-29
Publication date: 2021-09-14
Anticipated expiration: 2041-06-29
Also published as: CN113392798B

Abstract

The invention discloses a multi-model selection and fusion method for optimizing action recognition accuracy under resource constraints, belonging to the field of intelligent perception and multi-modal fusion, comprising: step 1, modeling of resource-limited parameters and resource parameters of a single action recognition model; Step 2, train the actor-critic reinforcement learning model to obtain the actor network as the online selection model and the critic network as the value scoring model; Step 3, run the corresponding models according to the model combination, and fuse the recognition results of the models as the final recognition results. The advantage of this method is that it can deal with strictly orthogonal resource constraints, and it can fuse data and models from multiple modalities. Compared with the direct end-to-end fusion method, it can achieve higher levels of accuracy. The method can be applied to action recognition in multi-modal environments when resources are limited, such as scenarios such as smart home, patient care, and unmanned driving.

Description

Multi-model selection and fusion method for optimizing motion recognition precision under resource limitation

Technical Field

The invention relates to the field of intelligent behavior perception, in particular to a multi-model selection and fusion method for optimizing motion recognition accuracy under resource limitation.

Background

With the development of intelligent sensing equipment and artificial intelligence recognition technology, intelligent behavior sensing is receiving more and more attention. For multi-modal perception scenes, such as smart homes, patient nursing, unmanned driving and other scenes, recognition results of multiple models are fused to improve recognition accuracy of multi-modal perception data collected in the scenes, so that opportunities are brought, and meanwhile, new challenges are provided.

The existing intelligent behavior perception methods are mainly divided into the following methods: 1) a method aiming at improving precision; 2) to balance resource consumption and accuracy. The former approach focuses on the final recognition accuracy without regard to the overhead of resources. The latter approach takes into account resource constraints, such as device occupancy, energy consumption, and the like.

However, the existing method for balancing resource consumption and precision only qualitatively considers energy consumption, but does not consider more strict and quantitative resource limitations, such as memory occupation, time delay caused by calculation time, and the like. In addition, the existing method for balancing resource consumption and precision only involves the fusion of two models or two modalities, and the fusion of more than two models or modalities is not realized.

Disclosure of Invention

Aiming at the problems in the prior art, the invention aims to provide a multi-model selection and fusion method for optimizing action recognition accuracy under resource limitation, which can solve the problems that the existing intelligent perception recognition method for balancing resource consumption and accuracy does not consider more strict quantitative resource limitation and only involves two models or two modes for fusion.

The purpose of the invention is realized by the following technical scheme:

the embodiment of the invention provides a multi-model selection method for optimizing action recognition precision under resource limitation, which comprises the following steps:

step 1, resource limiting parameter modeling and resource parameter modeling of each action recognition model:

modeling a resource limiting parameter determined according to processing resources into a total rectangular model, wherein the total memory limiting parameter of the resource limiting parameter is used as the length of the total rectangular model, and the total delay limiting parameter of the resource limiting parameter is used as the width of the total rectangular model;

modeling a resource parameter of each action recognition model in an action recognition model library into a sub-rectangular model, wherein a memory parameter of the resource parameter is used as the length of the sub-rectangular model, and a time delay parameter of the resource parameter is used as the width of the sub-rectangular model;

the length and the width of the sub-rectangular model are respectively smaller than those of the total rectangular model;

step 2, using an operator-critic reinforcement learning model as an online selection model, using multi-modal perception data aligned in time as an operator network for inputting and training the online selection model, operating each action recognition model in a model combination selected from the action recognition model library by the operator network, fusing the recognition results of each action recognition model to obtain a final recognition result, and judging whether the final recognition result is correct or not by comparing the final recognition result with an actual data label;

taking a model combination output by the multi-modal perception data and the operator network as a critic network for inputting and training an operator-critic reinforcement learning model to obtain the value of the current model combination;

utilizing the resource limiting parameter modeling of the step 1 and the resource parameter modeling of each action recognition model to judge whether the model combination exceeds the resource limiting parameter;

calculating a reward function by combining whether the final identification result is correct and whether the model combination exceeds the resource limit, and updating the parameters of the operator network and the critic network based on a gradient descent method according to the reward function;

step 3, online action recognition:

inputting multi-modal perception data to be recognized into a trained online selection model, outputting a model combination by the online selection model, operating each action recognition model in the model combination and fusing the recognition result of each action recognition model to obtain a final recognition result.

According to the technical scheme provided by the invention, the multi-model selection method for optimizing the action recognition accuracy under the resource limitation provided by the embodiment of the invention has the beneficial effects that:

by modeling the resource limitation parameters into a total rectangular model, whether the model combination selected by the online selection model meets the resource limitation or not is judged conveniently by using a rectangular packing mode, and then the optimal model combination can be obtained under the resource limitation, and the recognition precision after the multiple models are fused is optimized. The coincidence can dynamically fuse various models and modes under the limitation of memory and time, and the action recognition precision is optimized.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a flowchart of a multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraints according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of resource constraint parameter modeling provided by an embodiment of the present invention;

FIG. 3 is a schematic diagram of a system for collecting multimodal data according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of an online identification method according to an embodiment of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the specific contents of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to the person skilled in the art.

Referring to fig. 1, an embodiment of the present invention provides a multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraint, including:

step 3, online action recognition:

In the method, the loss function of the operator network is the negative of the average value of the critic network output; the penalty function of the critic network is the mean square error of the value it outputs and the reward value of the reward function calculated subsequently.

In step 2 of the method, a reward function is calculated by combining the correctness of the final recognition result and whether the model combination exceeds the resource limit parameter in the following manner, wherein the reward function is as follows:

the reward function r ∈ [0,1] includes: whether the model combination exceeds the resource limit rs is belonged to {0,1 }; and whether the final identification result of the model combination is correct re ∈ {0,1 }.

In the above method, determining whether the model combination exceeds the resource limit in the following manner includes:

judging whether the resource parameters of each action recognition model of the model combination correspond to the sub-rectangular models or not, putting the sub-rectangular models into the modeling total rectangular model corresponding to the resource limitation parameters established in the step 1, if the sub-rectangular models can be put into the modeling total rectangular model, determining that the model combination does not exceed the resource limitation, namely rs is 1, and if the sub-rectangular models cannot be put into the modeling total rectangular model, determining that the model combination exceeds the resource limitation, namely rs is 0;

and if the final identification result is consistent with the actual data label comparison, determining that the final identification result is correct, wherein re is 1, and otherwise, re is 0.

In the method, whether the resource parameters of each action recognition model of the model combination can be correspondingly divided into the rectangular models or not is judged through a rectangular packing algorithm, and the rectangular models are put into the modeling total rectangular model corresponding to the resource limitation parameters established in the step 1.

The combination of models resulting from steps 2 and 3 of the above method includes the selected plurality of models and the weight of each motion recognition model. The combination of models corresponds to a subset of models of the motion recognition model library. The weight of each action recognition model is automatically distributed by an operator network according to the reward of the critic network in the training and learning process.

In the method, the multi-modal sensing data is data sensed by multiple sensors to be identified.

In step 2 of the above method, the recognition results of the motion recognition models are fused in a weighted manner according to the weights of the motion recognition models in the model combination.

The method can select the optimal model combination meeting the resource limitation condition on line under the limitation of memory and time, and optimize the action recognition precision by dynamically fusing various models and modes.

The embodiments of the present invention are described in further detail below.

Referring to fig. 1, an embodiment of the present invention provides a multi-model selection method for optimizing motion recognition accuracy under resource constraint, including the following steps:

step 1, modeling a resource limitation parameter based on a rectangular packing algorithm: modeling the resource limiting parameter into a total rectangular model, taking the memory limiting parameter of the resource limiting parameter as the length of the total rectangular model, and taking the time delay limiting parameter of the resource limiting parameter as the width of the total rectangular model;

modeling the resource parameter of each action recognition model in the model library into a sub-rectangular model, wherein the memory parameter of the resource parameter is used as the length of the sub-rectangular model, and the time delay parameter of the resource parameter is used as the width of the sub-rectangular model;

the length and the width of the established sub-rectangle model are respectively smaller than those of the total rectangle model, namely the total rectangle model is a large rectangle, and the sub-rectangle models are small rectangles, so that the resource constraint is converted into whether a plurality of selected small rectangles can be placed in the large rectangle (see figure 2), and the small rectangles cannot rotate and cannot be overlapped;

step 2, taking an operator-critic reinforcement learning model as an online selection model, training the online selection model to select a model combination from a model library online, specifically:

step 21) training an actor network and a critic network of the actor-critic reinforcement learning model, wherein the input of the actor network is multi-modal perception data, and the output of the actor network is a model combination (comprising the selected models and the weight of each model); the input of the critic network is a model combination of multi-mode perception data and operator network output, and the output is the value of the current model combination; the loss function of the actor network is the negative of the mean of the values of the critic network output; the loss function of the critic network is the mean square error of the value output by the critic network and the reward value of the reward function obtained by subsequent calculation; the two networks update network parameters based on a gradient descent method;

step 22) adopting the following reward function feedback to update the operator network and the criticc network, wherein the reward function r belongs to [0,1]]The method comprises the following two aspects: whether the model combination exceeds the resource rs belongs to {0,1 }; whether the identification result fused with each model in the model combination is correct re ∈ {0,1}, and the reward function is specifically as follows:

step 3, operating corresponding action recognition models according to the model combination output by the online selection model, and fusing the recognition result of each action recognition model to obtain a final recognition result;

and 4, calculating the reward function in the step 22 according to each action recognition model in the model combination and the final fusion result, recording tuples consisting of input data, the model combination and the reward value, and updating the parameters of the operator network and the critic network based on a gradient descent method by using a historical record.

The method of the invention models two-dimensional resource constraint, dynamically selects and fuses a plurality of models by online selecting model combination in the steps, and improves the identification precision of multi-model fusion.

Examples

(1) Early preparation:

1a) collect multimodal perception data (see fig. 3), including: the method comprises the following steps of (1) smart phone acceleration sensor data, wifi data and sound data;

1b) respectively training a plurality of action recognition models in an action recognition model library by using perception data of different modes, wherein the method comprises the following steps: SVM, xgboost, LSTM model, and measuring the recognition accuracy, memory occupation and recognition time delay of the model;

(2) training an online selection model:

21) time alignment is carried out on the collected multi-modal perception data, and the multi-modal perception data are used for training an online selection model;

22) training an operator-critic reinforcement learning model serving as an online selection model by using multi-modal perception data aligned in time, and obtaining a model combination from a model library; a model is combined with a plurality of models, and each model is attached with a weight. The action recognition model library is provided with a plurality of action recognition models, and the model combination only comprises a plurality of action recognition models, and can be regarded as a subset of the model library. In addition, each motion recognition model in the model combination has a weight as the weight for the recognition result of the subsequent fusion. For example: the model library has 3 models, and the model combination may be one vector (a, b, c), where a, b, c represent motion recognition models one, two, three, respectively, where a, b, c take values between 0 and 1, and represent weights of the models one, two, three, and if a ═ 0 represents that the model one is not selected, a ═ 0.5 represents that the weight of the model one is 0.5.

23) Operating each model in the model combination, and fusing the recognition results of each model to obtain a final recognition result;

24) judging whether the final recognition result is correct or not by comparing the final recognition result with the actual data label; judging whether the model combination exceeds the resource limit or not through modeling in the step 1, calculating a reward function by combining the two conditions, and updating the parameters of the operator network and the critic network based on a gradient descent method;

(3) online action recognition (see fig. 4):

inputting multi-modal perception data into a trained online selection model, outputting a model combination by the online selection model, operating each action recognition model in the model combination and fusing the recognition result of each action recognition model to obtain a final recognition result.

The method of the invention utilizes the rectangular packing modeling condition limitation, and solves the problem of strict quantitative resource limitation; the model is designed on line based on the operator-critical framework, so that the problem that the label of the model combination is not unique and is difficult to obtain is solved; and optimizing the final recognition precision by dynamically selecting model combinations on line. The method has the advantages that strict orthogonal resource constraint can be processed, perception data in various modes and various models can be fused and utilized, and compared with a direct end-to-end fusion mode, the method can achieve higher precision under the condition of lower resource occupation. The method can be applied to action recognition in a multi-modal environment when resources are limited, such as scenes of intelligent home, patient nursing, unmanned driving and the like.

Those of ordinary skill in the art will understand that: all or part of the processes of the methods for implementing the embodiments may be implemented by a program, which may be stored in a computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods as described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraints, is characterized in that, comprising:

Step 1, resource limitation parameter modeling and resource parameter modeling for each action recognition model:

The resource limitation parameter determined according to the processing resource is modeled as a total rectangle model, the total memory limitation parameter of the resource limitation parameter is used as the length of the total rectangle model, and the total delay limitation parameter of the resource limitation parameter is used as the width of the total rectangle model. ;

The resource parameters of each action recognition model in the action recognition model library are modeled as a sub-rectangular model, the memory parameter of the resource parameter is used as the length of the sub-rectangular model, and the delay parameter of the resource parameter is used as the width of the sub-rectangular model. ;

The length and width of the sub-rectangular model are respectively smaller than the length and width of the total rectangular model;

Step 2, take the actor-critic reinforcement learning model as the online selection model, use the time-aligned multimodal perception data as the input to train the actor network of the online selection model, and run the actor network selected from the action recognition model library. Each action recognition model in the model combination is combined with the recognition results of each action recognition model to obtain the final recognition result, and whether the final recognition result is correct is judged by comparing the final recognition result with the actual data label;

Using the multimodal perception data and the model combination output by the actor network as the input to train the critic network of the actor-critic reinforcement learning model, the value of the current model combination is obtained;

Utilize the resource limitation parameter modeling of described step 1 and the resource parameter modeling of each action recognition model to judge whether the model combination exceeds the resource limitation parameter;

Calculate the reward function based on whether the final recognition result is correct and whether the model combination exceeds the resource limit, and update the parameters of the actor network and critic network based on the gradient descent method according to the reward function;

Step 3, online action recognition:

Input the multimodal perception data to be identified into the trained online selection model, output the model combination from the online selection model, run each action recognition model in the model combination and fuse the recognition results of each action recognition model, to obtain the final recognition result.

2. the multi-model selection and fusion method of optimizing action recognition accuracy under resource limitation according to claim 1, is characterized in that, the loss function of described actor network is the negative number of the mean value of the value of critic network output; the loss of critic network The function is the mean squared error between the value it outputs and the reward value of the reward function that is subsequently computed.

3. The multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraints according to claim 1 or 2, wherein in the method step 2, the correctness and model combination of the final recognition result are combined in the following manner Whether the resource limit parameter is exceeded to calculate the reward function, the reward function is:

The reward function r∈[0,1] includes: whether the model combination exceeds the resource limit rs∈{0,1}; whether the final recognition result of the model combination is correct re∈{0,1}.

4. The multi-model selection method for optimizing motion recognition accuracy under resource constraints according to claim 3, wherein judging whether the model combination exceeds the resource constraints in the following manner, comprising:

Determine whether the resource parameters of each action recognition model of the model combination correspond to the sub-rectangular models, and put them into the modeling total rectangular model corresponding to the resource limitation parameters established in the step 1. If they can be put in, the model combination is determined. The resource limit is not exceeded, that is, rs is 1. If it cannot be put in, it is determined that the model combination exceeds the resource limit, that is, rs is 0;

If the final recognition result is consistent with the actual data label comparison, it is determined that the final recognition result is correct, then re is 1, otherwise re is 0.

5. the multi-model selection method of optimizing motion recognition accuracy under the resource limitation according to claim 4, is characterized in that, whether the resource parameters corresponding to each motion recognition model of model combination can be divided into rectangular models by rectangular packing algorithm, Put it into the modeling total rectangle model corresponding to the resource limitation parameters established in step 1.

6. The multi-model selection method for optimizing motion recognition accuracy under resource constraints according to claim 1 or 2, wherein the model combination obtained in the steps 2 and 3 comprises a plurality of selected motion recognition models and Weights for each action recognition model.

7 . The multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraints according to claim 1 or 2 , wherein the multi-modal perception data is multi-sensor perception data to be identified. 8 .

8. The multi-model selection and fusion method for optimizing motion recognition accuracy under resource constraints according to claim 1 and 2, wherein in the method step 2, according to the weight of each motion recognition model in the model combination, weighted In this way, the recognition results of each action recognition model are fused.