CN113033430B - Artificial intelligence method, system and medium for multi-mode information processing based on bilinear - Google Patents

Artificial intelligence method, system and medium for multi-mode information processing based on bilinear Download PDF

Info

Publication number
CN113033430B
CN113033430B CN202110340725.5A CN202110340725A CN113033430B CN 113033430 B CN113033430 B CN 113033430B CN 202110340725 A CN202110340725 A CN 202110340725A CN 113033430 B CN113033430 B CN 113033430B
Authority
CN
China
Prior art keywords
bilinear
action
time sequence
sequence
gist
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110340725.5A
Other languages
Chinese (zh)
Other versions
CN113033430A (en
Inventor
胡建芳
侯智聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN202110340725.5A priority Critical patent/CN113033430B/en
Publication of CN113033430A publication Critical patent/CN113033430A/en
Application granted granted Critical
Publication of CN113033430B publication Critical patent/CN113033430B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Psychiatry (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Social Psychology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The application discloses an artificial intelligence method, a system and a medium for processing multi-mode information based on bilinear, wherein the method comprises the following steps: converting the video stream into image frames; dividing an action sequence; constructing skeleton time sequence characteristics, RGB and depth time sequence characteristics; constructing a three-dimensional feature cube and inputting the three-dimensional feature cube into a bilinear feature learning module; and outputting a classification recognition result. According to the application, the multi-mode information in the deep network fusion RGBD video is constructed in a bilinear processing mode, so that the defect that the information among modes is not deeply mined and accurate action behavior recognition is performed by simply splicing or weighting the characteristics or the activation vectors output by different modes in the existing multi-mode model is overcome. The bilinear operation is plane level calculation, has low calculation cost, and is suitable for application in the industrial field with high real-time requirement.

Description

Artificial intelligence method, system and medium for multi-mode information processing based on bilinear
Technical Field
The application belongs to the technical field of artificial intelligence, and particularly relates to an artificial intelligence method, system and medium for processing multi-mode information based on bilinear.
Background
With the development of technology, RGBD image technology is becoming popular, and unlike conventional RGB images, RGBD-based image information includes depth information of an individual in the image, and similarly, resolution of RGBD-based video information is also different from conventional RGB-based video information.
In some emerging monitoring scenarios, RGBD-based cameras are widely used, such as unmanned aerial vehicle shooting scenarios, and identifying individual actions in these scenarios for monitoring dangerous accidents, etc., is often one of the main purposes of using these monitoring devices. Thus RGBD-based behavior recognition is trivial.
The behavior identification based on RGBD is an important branch of multi-mode information research in the field of artificial intelligence, RGBD image or video information comprises RGB information and depth information of the image or video, hierarchical information such as distance from a camera and the like can be obtained through the RGBD image or video besides shape, color and the like, and the combination of multiple mode information is beneficial to better judging scene information by a model. Particularly, the video data aiming at human behaviors generally also contains human skeleton information, and the model can further improve the recognition capability of human actions by adding the skeleton information.
The existing multi-mode-based work is relatively simple for processing multi-mode information, and is mainly divided into two types of methods, namely, different Convolutional Neural Networks (CNNs) are used for extracting different mode characteristics to perform splicing operation, then the different mode characteristics pass through a full-connection-layer classifier, and the other method is used for obtaining different activation vectors through a complete network, and then weighting operation (average value taking generally) is performed on the different mode activation vectors to output classification results.
The simple processing of the multi-modal information fusion lacks deep mining of information relevance among the multi-modalities, so that the comprehensive utilization of the multi-modal information still has a defect. Deep mining of the information of each mode is a reasonable thought for overcoming the current defects.
The existing technology for processing RGBD information based on multi-mode data only fuses various mode information by simply splicing features or weighting different mode activation vectors, but for RGBD information flow, certain common information exists among all modes, and the existing technology for fusing information by simple operation cannot deeply mine information commonality to better improve the performance of a model.
For the bilinear method presented herein, the traditional bilinear method is only a conversion operation between some features, and most bilinear operations are element-level (element-wise) operations, which are computationally expensive, and are further reduced by converting them into plane-level (plane-wise) computational operations.
In addition, many technologies in the scene of human behavior recognition are operated based on original input images or video information, and the operation brings many extra background information, so that the model is not beneficial to extracting effective action information.
Disclosure of Invention
The application aims to overcome the defect that the relevance of multi-modal information cannot be deeply mined due to the fact that different modal information is simply spliced or weighted in the prior art, and provides an artificial intelligent method, an artificial intelligent system and an artificial intelligent medium for multi-modal information processing based on bilinear multi-modal information processing.
In order to achieve the above purpose, the present application adopts the following technical scheme:
the application provides an artificial intelligence method for multi-mode information processing based on bilinear, which comprises the following steps:
converting the video stream into image frames and dividing the image frames into action sequences;
constructing skeleton time sequence characteristics, RGB time sequence characteristics and depth time sequence characteristics according to the action sequences, and constructing a three-dimensional characteristic cube;
inputting the three-dimensional feature cube into a bilinear feature learning module to obtain an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer;
and taking the category corresponding to the maximum value in the activation vector as a classification result of the action recognition.
As an preferable technical solution, the specific dividing method of the action sequence is as follows:
dividing an input image sequence into D fragments at equal intervals, and marking a sequence formed by the first D image sequences as an action sequence with the length of D; finally obtaining the action sequences with lengths from 1 to D respectively and totaling D kinds.
As an optimal technical scheme, the construction method of the skeleton timing sequence characteristic specifically comprises the following steps:
and encoding the action sequence by using a dynamic skeleton encoder, and inputting the action sequence into a cyclic neural network to obtain the skeleton time sequence characteristic.
As an preferable technical scheme, the construction method of the RGB and depth time sequence features specifically includes:
constructing an action principal component diagram of RGB and depth: collecting local image blocks of each RGBD image frame near a skeleton joint point, and splicing the local image blocks into images representing motion information, namely a GIST image sequence;
building RGB and depth timing features: selecting K ordered action GIST images from the GIST image sequence, inputting the K ordered action GIST images into a convolutional neural network of a K channel, and extracting time sequence characteristics; the method for selecting K ordered action GIST images from the GIST image sequence specifically comprises the following steps: selecting the first in the GIST image sequenceThe frame is used as the (u) th frame in K ordered motion GIST images, where ls is the length of the input GIST sequence, delta is the disturbance factor, and is subject to uniform distribution +.>Is a random number of (a) in the memory.
As a preferable technical scheme, the construction of the three-dimensional feature cube specifically comprises the following steps:
splicing the skeleton time sequence characteristics, RGB and depth time sequence characteristics to form a three-dimensional characteristic cube, and marking the three-dimensional characteristic cube as A, whereinM A The mode dimension is T, the time dimension is T, and the characteristic channel dimension is C.
As a preferable technical scheme, the mode pooling layer is used for mining information of different modes, and the calculation process is as follows:
L(:,:,c)=X T a (: C), c=1, 2, …, C, i.e.:
wherein the matrixFor the weight matrix of the pooling layer, M A And M L Is the modal dimension.
As an preferable technical solution, the timing sequence pooling layer is used for mining timing sequence information, and the calculation process is as follows:
Z(:,:,c)=L(:,:,c)Y,c=1,2,…,C;
wherein the matrixA weight matrix for the pooling layer.
As an preferable technical scheme, the bilinear feature learning module specifically defines the following formula:
wherein f T For time sequence pooling layer, f M The layer is pooled for the mode.
The application also provides an artificial intelligence system based on bilinear multi-mode information processing, which comprises a preprocessing module, a feature construction module, a bilinear feature learning module and an identification module;
the preprocessing module is used for converting the video stream into image frames and dividing the image frames into action sequences;
the feature construction module constructs skeleton time sequence features, RGB and depth time sequence features according to the action sequences, and constructs a three-dimensional feature cube;
the input of the bilinear feature learning module is the three-dimensional feature cube, and the output is an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer;
the identification module is used for taking the category corresponding to the maximum value in the activation vector as the classification result of the action identification.
The application also provides a storage medium storing a program which, when executed by a processor, realizes the artificial intelligence method based on bilinear multi-mode information processing.
Compared with the prior art, the application has the following advantages and beneficial effects:
(1) According to the application, the multi-mode information in the deep network fusion RGBD video is constructed in a bilinear processing mode, so that the defects that the characteristics or the activation vectors output by different modes are simply spliced or weighted in the existing multi-mode model and the information among the modes is not deeply mined are overcome, and the accurate action behavior recognition is performed;
(2) The bilinear operation is plane level calculation, has low calculation cost, and is suitable for application in the industrial field with high real-time requirement.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of an artificial intelligence method for bi-linear based multi-modal information processing in accordance with an embodiment of the present application;
FIG. 2 is a schematic diagram of a building action as described in an embodiment of the present application;
FIG. 3 is a schematic diagram of a bilinear feature learning module according to an embodiment of the present application;
FIG. 4 is a schematic diagram illustrating the difference between element level operations and plane level operations according to an embodiment of the present application;
FIG. 5 is a schematic diagram of an artificial intelligence system based on bilinear multi-modal information processing according to an embodiment of the application;
FIG. 6 is a schematic diagram of an artificial intelligence system for bi-linear based multi-modal information processing applied to the field of motion/behavior monitoring in accordance with an embodiment of the present application;
fig. 7 is a schematic diagram of a storage medium according to an embodiment of the present application.
Detailed Description
In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
Example 1
As shown in fig. 1, the present embodiment provides an artificial intelligence method for multi-mode information processing based on bilinear, including the following steps:
step 1: converting the video stream into image frames;
more specifically, in this embodiment, a ffmpeg or other tool is used to convert the video stream into image frames.
Step 2: inputting the image frames obtained by the processing in the step 1 into a Convolutional Neural Network (CNN), inputting the image frames corresponding to the video into a cyclic neural network (RNN), and constructing a three-dimensional feature cube;
more specifically, the individual sub-steps are as follows:
step 2.1: constructing an action sequence (Action History Sequence, abbreviated as AHS hereinafter);
in order to facilitate the subsequent extraction of the time sequence features, the input image sequence is firstly divided into D segments at equal intervals, the sequence consisting of the first D image sequences is called as an AHS with the length D, and is marked as |ahs|=d, so that the AHSs with the lengths from 1 to D are obtained, and the total number of the AHSs is D.
Step 2.2: constructing a skeleton time sequence characteristic;
for the skeleton sequence, the present embodiment uses dynamic skeleton descriptor (DS encoder) to encode the AHS and inputs the AHS to the RNN, the output of the RNN as the skeleton timing feature in the present scheme.
Step 2.3: constructing an action principal component diagram (Action GIST image) of RGB and depth;
the process of constructing GIST image is shown in fig. 2, and for each RGBD image frame, collecting the local image blocks near the skeleton joint points, and splicing into images representing action information, namely GIST image sequences; the GIST image sequence is used for removing influence of irrelevant information such as background and the like on model modeling action information. In this embodiment, GIST images are taken at 64×64, while the change in the node of interest also better reflects the action timing information.
Step 2.4: constructing RGB and depth time sequence characteristics;
the GIST image sequence (RGB image sequence and depth image sequence) obtained in step 2.3 is input to the CNN of the K-channel to extract the timing characteristics, and for the CNN of the K-channel, K sequential action GIST images are selected from the GIST sequence as input.
The method for selecting K ordered action GIST images from the GIST image sequence specifically comprises the following steps: selecting the first in the GIST image sequenceThe frame is used as the (u) th frame in K ordered motion GIST images, where ls is the length of the input GIST sequence, delta is the disturbance factor, and is subject to uniform distribution +.>Is a random number of (a) in the memory.
More specifically, the present embodiment sets two sets of K for each of RGB and depth video sequences, respectively, with a network of k=1 for extracting appearance information and a network of k=16 for extracting dynamic information.
Step 2.5: constructing a feature cube;
and (3) splicing the skeleton time sequence characteristics, RGB and depth time sequence characteristics in the step 2.4 and the step 2.2 to form a three-dimensional characteristic cube.
Step 3: inputting the three-dimensional feature cube obtained in the step 2 into a bilinear feature learning module shown in fig. 1 and 2 to obtain an activation vector; the bilinear feature learning module is formed by stacking a plurality of modal pooling layers (modality pooling layer) and time sequence pooling layers (temporal pooling layer), as shown in fig. 3, in this step, the construction of the two pooling layers and how to stack the two pooling layers are detailed in the step;
step 3.1: the three-dimensional feature cube obtained in the step 2 is recorded as A, whereinM A Represents the modal dimension, T represents the time dimension, and C represents the characteristic channel dimension.
Step 3.2: the modal pooling layer is used for mining different modal information, and the calculation process is as follows:
L(:,:,c)=X T A(:,:,c),c=1,2,…,C,(1)
wherein the matrixFor the weight matrix of the pooling layer, M A And M L Is a modal dimension; obviously, through this operation, the modal dimension of input a is from M A Pooling to M L . For simplicity, this embodiment defines this layer as f M
In particular, it is easy to demonstrate that equation (1) can be equivalently written in the form:
from equation (2), it is known that elements of the same modality are weighted by the same parameters, so that the pooling operation is plane-level (plane-wise), a significant amount of computation will be saved relative to conventional linear layer element level (element-wise) operations, as shown in fig. 4.
Step 3.3: the time sequence pooling layer is used for mining time sequence information, and the calculation process is as follows:
Z(:,:,c)=L(:,:,c)Y,c=1,2,…,C,(3)
wherein the matrixA weight matrix for the pooling layer;
in particular, the time-sequential pooling layer can be similarly converted into computation in the modal pooling layer (refer to conversion from equation (1) to equation (2)), by only rearranging the elements of the time dimension and the modal dimension (i.e. the permate operation commonly seen in multi-dimensional data), so that the computation amount of the layer is the sameMay be optimized to a planar level. For simplicity, this embodiment defines this layer as f T
Step 3.4: the bilinear feature learning module is specifically defined as follows:
as can be seen from the formula (4), the bilinear feature learning module of the present embodiment is formed by stacking N bilinear layers, as shown in fig. 3. In order to improve model robustness, L1 regularization and L2 regularization are added into the bilinear feature learning module.
Step 4: the analysis and identification result is specifically as follows:
and (3) taking the category corresponding to the maximum value in the activation vector obtained in the step (3) as a classification result of the action recognition.
Example two
As shown in fig. 5, the embodiment provides an artificial intelligence system based on bilinear multi-mode information processing, which comprises a preprocessing module, a feature construction module, a bilinear feature learning module and an identification module;
the preprocessing module is used for converting the video stream into image frames and dividing the image frames into action sequences;
the feature construction module constructs skeleton time sequence features, RGB and depth time sequence features according to the action sequences, and constructs a three-dimensional feature cube;
the input of the bilinear feature learning module is the three-dimensional feature cube, and the output is an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer;
the identification module is used for taking the category corresponding to the maximum value in the activation vector as the classification result of the action identification.
In particular, the present embodiment further provides an implementation manner of applying the bilinear multi-modal information processing based artificial intelligence system described in the second embodiment to the field of motion/behavior monitoring, as shown in fig. 6;
fig. 6 shows a motion monitoring system based on a depth camera, wherein the game obtains different results according to different motions made by a player, and the planar operation of the bilinear layer of the embodiment can reduce the calculation cost and bring better user experience due to higher real-time requirement of the monitoring system, and the specific steps are as follows:
s1, capturing RGBD video information of actions made by a user by a sensor or a camera;
s2, inputting the collected RGBD video information into an artificial intelligent system based on bilinear multi-mode information processing, and outputting motion classification vectors;
s3, obtaining action classification according to the size of the probability vector and judging whether the action is some monitored action or not;
and S4, the action monitoring system gives feedback according to the action category.
It should be noted that, the system provided in the second embodiment is only exemplified by the above-mentioned division of each functional module, and in practical application, the above-mentioned function allocation may be performed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to perform all or part of the functions described above, and the system is an artificial intelligence method applied to the bi-linear multi-mode information processing described in the first embodiment.
As shown in fig. 7, the present embodiment further provides a storage medium storing a program, where when the program is executed by a processor, the artificial intelligence method for implementing bilinear multi-mode information processing is specifically:
s1, converting a video stream into image frames and dividing the image frames into action sequences;
s2, constructing skeleton time sequence characteristics, RGB time sequence characteristics and depth time sequence characteristics according to the action sequences, and constructing a three-dimensional characteristic cube;
s3, inputting the three-dimensional feature cube into a bilinear feature learning module to obtain an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer;
and S4, taking the category corresponding to the maximum value in the activation vector as a classification result of action recognition.
It is to be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.
The above examples are preferred embodiments of the present application, but the embodiments of the present application are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present application should be made in the equivalent manner, and the embodiments are included in the protection scope of the present application.

Claims (7)

1. The artificial intelligence method for processing the multi-mode information based on bilinear is characterized by comprising the following steps of:
converting the video stream into image frames and dividing the image frames into action sequences;
constructing skeleton time sequence characteristics, RGB time sequence characteristics and depth time sequence characteristics according to the action sequences, and constructing a three-dimensional characteristic cube; the construction of the three-dimensional feature cube comprises the following steps:
splicing the skeleton time sequence characteristics, RGB and depth time sequence characteristics to form a three-dimensional characteristic cube, and marking the three-dimensional characteristic cube as A, whereinM A The dimension is the mode dimension, T is the time dimension, and C is the feature channel dimension;
inputting the three-dimensional feature cube into a bilinear feature learning module to obtain an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer; the mode pooling layer is used for mining information of different modes, and the calculation process is as follows:
L(:,:,c)=X T a (: C), c=1, 2, …, C, i.e.:
wherein the matrixFor the weight matrix of the pooling layer, M A And M L Is a modal dimension;
the time sequence pooling layer is used for mining time sequence information, and the calculation process is as follows:
Z(:,:,c)=L(:,:,c)Y,c=1,2,…,C;
wherein the matrixA weight matrix for the pooling layer;
and taking the category corresponding to the maximum value in the activation vector as a classification result of the action recognition.
2. The bi-linear based multi-modal information processing method according to claim 1, wherein the specific division method of the action sequence is:
dividing an input image sequence into D fragments at equal intervals, and marking a sequence formed by the first D image sequences as an action sequence with the length of D; finally obtaining the action sequences with lengths from 1 to D respectively and totaling D kinds.
3. The bi-linear based multi-modal information processing method according to claim 1, wherein the skeleton timing feature constructing method specifically includes:
and encoding the action sequence by using a dynamic skeleton encoder, and inputting the action sequence into a cyclic neural network to obtain the skeleton time sequence characteristic.
4. The bi-linear based multi-modal information processing method according to claim 1, wherein the construction method of RGB and depth timing characteristics specifically comprises:
constructing an action principal component diagram of RGB and depth: collecting local image blocks of each RGBD image frame near a skeleton joint point, and splicing the local image blocks into images representing motion information, namely a GIST image sequence;
building RGB and depth timing features: selecting K ordered action GIST images from the GIST image sequence, inputting the K ordered action GIST images into a convolutional neural network of a K channel, and extracting time sequence characteristics; the method for selecting K ordered action GIST images from the GIST image sequence specifically comprises the following steps: selecting the first in the GIST image sequenceThe frame is used as the (u) th frame in K ordered motion GIST images, where ls is the length of the input GIST sequence, delta is the disturbance factor, and is subject to uniform distribution +.>Is a random number of (a) in the memory.
5. The bi-linear based multi-modal information processing method of claim 1, wherein the bilinear feature learning module specifically defines the following formula:
wherein f T For time sequence pooling layer, f M The layer is pooled for the mode.
6. An artificial intelligence system based on bilinear multi-mode information processing, which is characterized by being applied to the artificial intelligence method based on bilinear multi-mode information processing as claimed in any one of claims 1-5, and comprising a preprocessing module, a feature construction module, a bilinear feature learning module and an identification module;
the preprocessing module is used for converting the video stream into image frames and dividing the image frames into action sequences;
the feature construction module constructs skeleton time sequence features, RGB and depth time sequence features according to the action sequences, and constructs a three-dimensional feature cube;
the input of the bilinear feature learning module is the three-dimensional feature cube, and the output is an activation vector; the bilinear feature learning module is a stack of a plurality of modal pooling layers and a time sequence pooling layer;
the identification module is used for taking the category corresponding to the maximum value in the activation vector as the classification result of the action identification.
7. A storage medium storing a program which, when executed by a processor, implements the artificial intelligence method of bilinear multi-modal information processing according to any one of claims 1 to 5.
CN202110340725.5A 2021-03-30 2021-03-30 Artificial intelligence method, system and medium for multi-mode information processing based on bilinear Active CN113033430B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110340725.5A CN113033430B (en) 2021-03-30 2021-03-30 Artificial intelligence method, system and medium for multi-mode information processing based on bilinear

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110340725.5A CN113033430B (en) 2021-03-30 2021-03-30 Artificial intelligence method, system and medium for multi-mode information processing based on bilinear

Publications (2)

Publication Number Publication Date
CN113033430A CN113033430A (en) 2021-06-25
CN113033430B true CN113033430B (en) 2023-10-03

Family

ID=76453013

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110340725.5A Active CN113033430B (en) 2021-03-30 2021-03-30 Artificial intelligence method, system and medium for multi-mode information processing based on bilinear

Country Status (1)

Country Link
CN (1) CN113033430B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960337A (en) * 2018-07-18 2018-12-07 浙江大学 A kind of multi-modal complicated activity recognition method based on deep learning model
CN109460707A (en) * 2018-10-08 2019-03-12 华南理工大学 A kind of multi-modal action identification method based on deep neural network
CN110532861A (en) * 2019-07-18 2019-12-03 西安电子科技大学 Activity recognition method based on skeleton guidance multi-modal fusion neural network
CN111259804A (en) * 2020-01-16 2020-06-09 合肥工业大学 Multi-mode fusion sign language recognition system and method based on graph convolution

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960337A (en) * 2018-07-18 2018-12-07 浙江大学 A kind of multi-modal complicated activity recognition method based on deep learning model
CN109460707A (en) * 2018-10-08 2019-03-12 华南理工大学 A kind of multi-modal action identification method based on deep neural network
CN110532861A (en) * 2019-07-18 2019-12-03 西安电子科技大学 Activity recognition method based on skeleton guidance multi-modal fusion neural network
CN111259804A (en) * 2020-01-16 2020-06-09 合肥工业大学 Multi-mode fusion sign language recognition system and method based on graph convolution

Also Published As

Publication number Publication date
CN113033430A (en) 2021-06-25

Similar Documents

Publication Publication Date Title
EP4156017A1 (en) Action recognition method and apparatus, and device and storage medium
CN111368943B (en) Method and device for identifying object in image, storage medium and electronic device
JP6924517B2 (en) How to recognize faces using multiple patch combinations of deep neural network infrastructure to improve fault tolerance and fracture robustness in extreme situations
CN111582316A (en) RGB-D significance target detection method
CN111985343A (en) Method for constructing behavior recognition deep network model and behavior recognition method
CN110222718B (en) Image processing method and device
CN112232325B (en) Sample data processing method and device, storage medium and electronic equipment
CN110232418B (en) Semantic recognition method, terminal and computer readable storage medium
CN112037142B (en) Image denoising method, device, computer and readable storage medium
CN113592041B (en) Image processing method, apparatus, device, storage medium, and computer program product
CN112801063B (en) Neural network system and image crowd counting method based on neural network system
CN115035298A (en) City streetscape semantic segmentation enhancement method based on multi-dimensional attention mechanism
CN113139446A (en) End-to-end automatic driving behavior decision method, system and terminal equipment
Salem et al. Semantic image inpainting using self-learning encoder-decoder and adversarial loss
Liu et al. Two-stream refinement network for RGB-D saliency detection
CN113449656B (en) Driver state identification method based on improved convolutional neural network
CN110991298A (en) Image processing method and device, storage medium and electronic device
Weilharter et al. ATLAS-MVSNet: Attention layers for feature extraction and cost volume regularization in multi-view stereo
CN113033430B (en) Artificial intelligence method, system and medium for multi-mode information processing based on bilinear
CN112446245A (en) Efficient motion characterization method and device based on small displacement of motion boundary
CN116402874A (en) Spacecraft depth complementing method based on time sequence optical image and laser radar data
CN114120076B (en) Cross-view video gait recognition method based on gait motion estimation
CN115100559A (en) Motion prediction method and system based on lattice point optical flow
CN111047571B (en) Image salient target detection method with self-adaptive selection training process
CN109815911B (en) Video moving object detection system, method and terminal based on depth fusion network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant