CN113033430B

CN113033430B - Artificial intelligence method, system and medium for multi-mode information processing based on bilinear

Info

Publication number: CN113033430B
Application number: CN202110340725.5A
Authority: CN
Inventors: 胡建芳; 侯智聪
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2021-03-30
Filing date: 2021-03-30
Publication date: 2023-10-03
Anticipated expiration: 2041-03-30
Also published as: CN113033430A

Abstract

The application discloses an artificial intelligence method, a system and a medium for processing multi-mode information based on bilinear, wherein the method comprises the following steps: converting the video stream into image frames; dividing an action sequence; constructing skeleton time sequence characteristics, RGB and depth time sequence characteristics; constructing a three-dimensional feature cube and inputting the three-dimensional feature cube into a bilinear feature learning module; and outputting a classification recognition result. According to the application, the multi-mode information in the deep network fusion RGBD video is constructed in a bilinear processing mode, so that the defect that the information among modes is not deeply mined and accurate action behavior recognition is performed by simply splicing or weighting the characteristics or the activation vectors output by different modes in the existing multi-mode model is overcome. The bilinear operation is plane level calculation, has low calculation cost, and is suitable for application in the industrial field with high real-time requirement.

Description

Artificial intelligence method, system and medium for multi-mode information processing based on bilinear

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to an artificial intelligence method, system and medium for processing multi-mode information based on bilinear.

Background

With the development of technology, RGBD image technology is becoming popular, and unlike conventional RGB images, RGBD-based image information includes depth information of an individual in the image, and similarly, resolution of RGBD-based video information is also different from conventional RGB-based video information.

In some emerging monitoring scenarios, RGBD-based cameras are widely used, such as unmanned aerial vehicle shooting scenarios, and identifying individual actions in these scenarios for monitoring dangerous accidents, etc., is often one of the main purposes of using these monitoring devices. Thus RGBD-based behavior recognition is trivial.

The behavior identification based on RGBD is an important branch of multi-mode information research in the field of artificial intelligence, RGBD image or video information comprises RGB information and depth information of the image or video, hierarchical information such as distance from a camera and the like can be obtained through the RGBD image or video besides shape, color and the like, and the combination of multiple mode information is beneficial to better judging scene information by a model. Particularly, the video data aiming at human behaviors generally also contains human skeleton information, and the model can further improve the recognition capability of human actions by adding the skeleton information.

The existing multi-mode-based work is relatively simple for processing multi-mode information, and is mainly divided into two types of methods, namely, different Convolutional Neural Networks (CNNs) are used for extracting different mode characteristics to perform splicing operation, then the different mode characteristics pass through a full-connection-layer classifier, and the other method is used for obtaining different activation vectors through a complete network, and then weighting operation (average value taking generally) is performed on the different mode activation vectors to output classification results.

The simple processing of the multi-modal information fusion lacks deep mining of information relevance among the multi-modalities, so that the comprehensive utilization of the multi-modal information still has a defect. Deep mining of the information of each mode is a reasonable thought for overcoming the current defects.

The existing technology for processing RGBD information based on multi-mode data only fuses various mode information by simply splicing features or weighting different mode activation vectors, but for RGBD information flow, certain common information exists among all modes, and the existing technology for fusing information by simple operation cannot deeply mine information commonality to better improve the performance of a model.

For the bilinear method presented herein, the traditional bilinear method is only a conversion operation between some features, and most bilinear operations are element-level (element-wise) operations, which are computationally expensive, and are further reduced by converting them into plane-level (plane-wise) computational operations.

In addition, many technologies in the scene of human behavior recognition are operated based on original input images or video information, and the operation brings many extra background information, so that the model is not beneficial to extracting effective action information.

Disclosure of Invention

The application aims to overcome the defect that the relevance of multi-modal information cannot be deeply mined due to the fact that different modal information is simply spliced or weighted in the prior art, and provides an artificial intelligent method, an artificial intelligent system and an artificial intelligent medium for multi-modal information processing based on bilinear multi-modal information processing.

In order to achieve the above purpose, the present application adopts the following technical scheme:

the application provides an artificial intelligence method for multi-mode information processing based on bilinear, which comprises the following steps:

converting the video stream into image frames and dividing the image frames into action sequences;

constructing skeleton time sequence characteristics, RGB time sequence characteristics and depth time sequence characteristics according to the action sequences, and constructing a three-dimensional characteristic cube;

inputting the three-dimensional feature cube into a bilinear feature learning module to obtain an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer;

and taking the category corresponding to the maximum value in the activation vector as a classification result of the action recognition.

As an preferable technical solution, the specific dividing method of the action sequence is as follows:

dividing an input image sequence into D fragments at equal intervals, and marking a sequence formed by the first D image sequences as an action sequence with the length of D; finally obtaining the action sequences with lengths from 1 to D respectively and totaling D kinds.

As an optimal technical scheme, the construction method of the skeleton timing sequence characteristic specifically comprises the following steps:

and encoding the action sequence by using a dynamic skeleton encoder, and inputting the action sequence into a cyclic neural network to obtain the skeleton time sequence characteristic.

As an preferable technical scheme, the construction method of the RGB and depth time sequence features specifically includes:

constructing an action principal component diagram of RGB and depth: collecting local image blocks of each RGBD image frame near a skeleton joint point, and splicing the local image blocks into images representing motion information, namely a GIST image sequence;

building RGB and depth timing features: selecting K ordered action GIST images from the GIST image sequence, inputting the K ordered action GIST images into a convolutional neural network of a K channel, and extracting time sequence characteristics; the method for selecting K ordered action GIST images from the GIST image sequence specifically comprises the following steps: selecting the first in the GIST image sequenceThe frame is used as the (u) th frame in K ordered motion GIST images, where ls is the length of the input GIST sequence, delta is the disturbance factor, and is subject to uniform distribution +.>Is a random number of (a) in the memory.

As a preferable technical scheme, the construction of the three-dimensional feature cube specifically comprises the following steps:

splicing the skeleton time sequence characteristics, RGB and depth time sequence characteristics to form a three-dimensional characteristic cube, and marking the three-dimensional characteristic cube as A, whereinM _A The mode dimension is T, the time dimension is T, and the characteristic channel dimension is C.

As a preferable technical scheme, the mode pooling layer is used for mining information of different modes, and the calculation process is as follows:

L(:,:,c)＝X ^T a (: C), c=1, 2, …, C, i.e.:

wherein the matrixFor the weight matrix of the pooling layer, M _A And M _L Is the modal dimension.

As an preferable technical solution, the timing sequence pooling layer is used for mining timing sequence information, and the calculation process is as follows:

Z(:,:,c)＝L(:,:,c)Y，c＝1,2,…,C；

wherein the matrixA weight matrix for the pooling layer.

As an preferable technical scheme, the bilinear feature learning module specifically defines the following formula:

wherein f _T For time sequence pooling layer, f _M The layer is pooled for the mode.

The application also provides an artificial intelligence system based on bilinear multi-mode information processing, which comprises a preprocessing module, a feature construction module, a bilinear feature learning module and an identification module;

the preprocessing module is used for converting the video stream into image frames and dividing the image frames into action sequences;

the feature construction module constructs skeleton time sequence features, RGB and depth time sequence features according to the action sequences, and constructs a three-dimensional feature cube;

the input of the bilinear feature learning module is the three-dimensional feature cube, and the output is an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer;

the identification module is used for taking the category corresponding to the maximum value in the activation vector as the classification result of the action identification.

The application also provides a storage medium storing a program which, when executed by a processor, realizes the artificial intelligence method based on bilinear multi-mode information processing.

Compared with the prior art, the application has the following advantages and beneficial effects:

(1) According to the application, the multi-mode information in the deep network fusion RGBD video is constructed in a bilinear processing mode, so that the defects that the characteristics or the activation vectors output by different modes are simply spliced or weighted in the existing multi-mode model and the information among the modes is not deeply mined are overcome, and the accurate action behavior recognition is performed;

(2) The bilinear operation is plane level calculation, has low calculation cost, and is suitable for application in the industrial field with high real-time requirement.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an artificial intelligence method for bi-linear based multi-modal information processing in accordance with an embodiment of the present application;

FIG. 2 is a schematic diagram of a building action as described in an embodiment of the present application;

FIG. 3 is a schematic diagram of a bilinear feature learning module according to an embodiment of the present application;

FIG. 4 is a schematic diagram illustrating the difference between element level operations and plane level operations according to an embodiment of the present application;

FIG. 5 is a schematic diagram of an artificial intelligence system based on bilinear multi-modal information processing according to an embodiment of the application;

FIG. 6 is a schematic diagram of an artificial intelligence system for bi-linear based multi-modal information processing applied to the field of motion/behavior monitoring in accordance with an embodiment of the present application;

fig. 7 is a schematic diagram of a storage medium according to an embodiment of the present application.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

Example 1

As shown in fig. 1, the present embodiment provides an artificial intelligence method for multi-mode information processing based on bilinear, including the following steps:

step 1: converting the video stream into image frames;

more specifically, in this embodiment, a ffmpeg or other tool is used to convert the video stream into image frames.

Step 2: inputting the image frames obtained by the processing in the step 1 into a Convolutional Neural Network (CNN), inputting the image frames corresponding to the video into a cyclic neural network (RNN), and constructing a three-dimensional feature cube;

more specifically, the individual sub-steps are as follows:

step 2.1: constructing an action sequence (Action History Sequence, abbreviated as AHS hereinafter);

in order to facilitate the subsequent extraction of the time sequence features, the input image sequence is firstly divided into D segments at equal intervals, the sequence consisting of the first D image sequences is called as an AHS with the length D, and is marked as |ahs|=d, so that the AHSs with the lengths from 1 to D are obtained, and the total number of the AHSs is D.

Step 2.2: constructing a skeleton time sequence characteristic;

for the skeleton sequence, the present embodiment uses dynamic skeleton descriptor (DS encoder) to encode the AHS and inputs the AHS to the RNN, the output of the RNN as the skeleton timing feature in the present scheme.

Step 2.3: constructing an action principal component diagram (Action GIST image) of RGB and depth;

the process of constructing GIST image is shown in fig. 2, and for each RGBD image frame, collecting the local image blocks near the skeleton joint points, and splicing into images representing action information, namely GIST image sequences; the GIST image sequence is used for removing influence of irrelevant information such as background and the like on model modeling action information. In this embodiment, GIST images are taken at 64×64, while the change in the node of interest also better reflects the action timing information.

Step 2.4: constructing RGB and depth time sequence characteristics;

the GIST image sequence (RGB image sequence and depth image sequence) obtained in step 2.3 is input to the CNN of the K-channel to extract the timing characteristics, and for the CNN of the K-channel, K sequential action GIST images are selected from the GIST sequence as input.

The method for selecting K ordered action GIST images from the GIST image sequence specifically comprises the following steps: selecting the first in the GIST image sequenceThe frame is used as the (u) th frame in K ordered motion GIST images, where ls is the length of the input GIST sequence, delta is the disturbance factor, and is subject to uniform distribution +.>Is a random number of (a) in the memory.

More specifically, the present embodiment sets two sets of K for each of RGB and depth video sequences, respectively, with a network of k=1 for extracting appearance information and a network of k=16 for extracting dynamic information.

Step 2.5: constructing a feature cube;

and (3) splicing the skeleton time sequence characteristics, RGB and depth time sequence characteristics in the step 2.4 and the step 2.2 to form a three-dimensional characteristic cube.

Step 3: inputting the three-dimensional feature cube obtained in the step 2 into a bilinear feature learning module shown in fig. 1 and 2 to obtain an activation vector; the bilinear feature learning module is formed by stacking a plurality of modal pooling layers (modality pooling layer) and time sequence pooling layers (temporal pooling layer), as shown in fig. 3, in this step, the construction of the two pooling layers and how to stack the two pooling layers are detailed in the step;

step 3.1: the three-dimensional feature cube obtained in the step 2 is recorded as A, whereinM _A Represents the modal dimension, T represents the time dimension, and C represents the characteristic channel dimension.

Step 3.2: the modal pooling layer is used for mining different modal information, and the calculation process is as follows:

L(:,:,c)＝X ^T A(:,:,c)，c＝1,2,…,C，(1)

wherein the matrixFor the weight matrix of the pooling layer, M _A And M _L Is a modal dimension; obviously, through this operation, the modal dimension of input a is from M _A Pooling to M _L . For simplicity, this embodiment defines this layer as f _M 。

In particular, it is easy to demonstrate that equation (1) can be equivalently written in the form:

from equation (2), it is known that elements of the same modality are weighted by the same parameters, so that the pooling operation is plane-level (plane-wise), a significant amount of computation will be saved relative to conventional linear layer element level (element-wise) operations, as shown in fig. 4.

Step 3.3: the time sequence pooling layer is used for mining time sequence information, and the calculation process is as follows:

Z(:,:,c)＝L(:,:,c)Y，c＝1,2,…,C，(3)

wherein the matrixA weight matrix for the pooling layer;

in particular, the time-sequential pooling layer can be similarly converted into computation in the modal pooling layer (refer to conversion from equation (1) to equation (2)), by only rearranging the elements of the time dimension and the modal dimension (i.e. the permate operation commonly seen in multi-dimensional data), so that the computation amount of the layer is the sameMay be optimized to a planar level. For simplicity, this embodiment defines this layer as f _T 。

Step 3.4: the bilinear feature learning module is specifically defined as follows:

as can be seen from the formula (4), the bilinear feature learning module of the present embodiment is formed by stacking N bilinear layers, as shown in fig. 3. In order to improve model robustness, L1 regularization and L2 regularization are added into the bilinear feature learning module.

Step 4: the analysis and identification result is specifically as follows:

and (3) taking the category corresponding to the maximum value in the activation vector obtained in the step (3) as a classification result of the action recognition.

Example two

As shown in fig. 5, the embodiment provides an artificial intelligence system based on bilinear multi-mode information processing, which comprises a preprocessing module, a feature construction module, a bilinear feature learning module and an identification module;

In particular, the present embodiment further provides an implementation manner of applying the bilinear multi-modal information processing based artificial intelligence system described in the second embodiment to the field of motion/behavior monitoring, as shown in fig. 6;

fig. 6 shows a motion monitoring system based on a depth camera, wherein the game obtains different results according to different motions made by a player, and the planar operation of the bilinear layer of the embodiment can reduce the calculation cost and bring better user experience due to higher real-time requirement of the monitoring system, and the specific steps are as follows:

s1, capturing RGBD video information of actions made by a user by a sensor or a camera;

s2, inputting the collected RGBD video information into an artificial intelligent system based on bilinear multi-mode information processing, and outputting motion classification vectors;

s3, obtaining action classification according to the size of the probability vector and judging whether the action is some monitored action or not;

and S4, the action monitoring system gives feedback according to the action category.

It should be noted that, the system provided in the second embodiment is only exemplified by the above-mentioned division of each functional module, and in practical application, the above-mentioned function allocation may be performed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to perform all or part of the functions described above, and the system is an artificial intelligence method applied to the bi-linear multi-mode information processing described in the first embodiment.

As shown in fig. 7, the present embodiment further provides a storage medium storing a program, where when the program is executed by a processor, the artificial intelligence method for implementing bilinear multi-mode information processing is specifically:

s1, converting a video stream into image frames and dividing the image frames into action sequences;

s2, constructing skeleton time sequence characteristics, RGB time sequence characteristics and depth time sequence characteristics according to the action sequences, and constructing a three-dimensional characteristic cube;

s3, inputting the three-dimensional feature cube into a bilinear feature learning module to obtain an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer;

and S4, taking the category corresponding to the maximum value in the activation vector as a classification result of action recognition.

It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

The above examples are preferred embodiments of the present application, but the embodiments of the present application are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present application should be made in the equivalent manner, and the embodiments are included in the protection scope of the present application.

Claims

1. The artificial intelligence method for processing the multi-mode information based on bilinear is characterized by comprising the following steps of:

constructing skeleton time sequence characteristics, RGB time sequence characteristics and depth time sequence characteristics according to the action sequences, and constructing a three-dimensional characteristic cube; the construction of the three-dimensional feature cube comprises the following steps:

splicing the skeleton time sequence characteristics, RGB and depth time sequence characteristics to form a three-dimensional characteristic cube, and marking the three-dimensional characteristic cube as A, whereinM _A The dimension is the mode dimension, T is the time dimension, and C is the feature channel dimension;

inputting the three-dimensional feature cube into a bilinear feature learning module to obtain an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer; the mode pooling layer is used for mining information of different modes, and the calculation process is as follows:

L(:,:,c)＝X ^T a (: C), c=1, 2, …, C, i.e.:

wherein the matrixFor the weight matrix of the pooling layer, M _A And M _L Is a modal dimension;

the time sequence pooling layer is used for mining time sequence information, and the calculation process is as follows:

Z(:,:,c)＝L(:,:,c)Y，c＝1,2,…,C；

wherein the matrixA weight matrix for the pooling layer;

2. The bi-linear based multi-modal information processing method according to claim 1, wherein the specific division method of the action sequence is:

3. The bi-linear based multi-modal information processing method according to claim 1, wherein the skeleton timing feature constructing method specifically includes:

4. The bi-linear based multi-modal information processing method according to claim 1, wherein the construction method of RGB and depth timing characteristics specifically comprises:

5. The bi-linear based multi-modal information processing method of claim 1, wherein the bilinear feature learning module specifically defines the following formula:

6. An artificial intelligence system based on bilinear multi-mode information processing, which is characterized by being applied to the artificial intelligence method based on bilinear multi-mode information processing as claimed in any one of claims 1-5, and comprising a preprocessing module, a feature construction module, a bilinear feature learning module and an identification module;

7. A storage medium storing a program which, when executed by a processor, implements the artificial intelligence method of bilinear multi-modal information processing according to any one of claims 1 to 5.