CN115937975A

CN115937975A - Action recognition method and system based on multi-modal sequence fusion

Info

Publication number: CN115937975A
Application number: CN202211568552.3A
Authority: CN
Inventors: 曾国坤; 刘予川
Original assignee: Harbin Institute of Technology Shenzhen
Current assignee: Harbin Institute of Technology Shenzhen
Priority date: 2022-12-08
Filing date: 2022-12-08
Publication date: 2023-04-07

Abstract

The invention discloses an action recognition method based on multi-modal sequence fusion, which comprises the steps of obtaining a human action video to be recognized, carrying out action labeling on the action video to obtain a video frame, obtaining a spatial position corresponding to human action, detecting spatial candidate frame coordinates and action category division of each frame action, adopting correlation progress action time domain detection among continuous frames, locating action occurring time periods, preprocessing the video frame to obtain a data set, adopting a convolutional neural network and a long-short term memory network to carry out feature extraction on the data set to obtain action features, constructing a network model by combining the time domain features corresponding to the action features and the time domain features, inputting a plurality of modal information into the network model to carry out feature fusion and classification so as to complete the action recognition of a human body, adopting the human action to carry out recognition and multi-modal fusion, enhancing the accuracy and robustness of the model in a real scene, and improving the stability and recognition accuracy of working habits.

Description

An action recognition method and system based on multimodal sequence fusion

技术领域Technical Field

本发明属于动作识别技术领域，尤其涉及一种基于多模态序列融合的动作识别方法及系统。The present invention belongs to the technical field of action recognition, and in particular relates to an action recognition method and system based on multimodal sequence fusion.

背景技术Background Art

人体行为识别一直以来是人机交互领域的一项热点技术，人体识别技术具有广泛的应用前景，能够带来良好的经济效益，许多实际场景都与它息息相关，如视频行为监视系统中识别危险活动、自动导航系统中感知人类的行为易实现安全操作。由于人类日常行为活动复杂且多样化，微小的动作变化可能会产生完全不同的行为，且随着所处环境的变化而变化，虽然动作识别已经广泛应用于社会的各个方面中，但是在现实世界中，该领域还有很多问题亟待解决，如视角变换问题、动作尺度差异问题等，同时，如何快速而有效地获取多模态动作信息中的内在联系并进行高效建模也是一项具有挑战性的问题。Human behavior recognition has always been a hot technology in the field of human-computer interaction. Human recognition technology has broad application prospects and can bring good economic benefits. Many practical scenarios are closely related to it, such as identifying dangerous activities in video behavior monitoring systems and sensing human behavior in automatic navigation systems to facilitate safe operations. Due to the complexity and diversity of human daily behavior activities, slight changes in movements may produce completely different behaviors, and change with changes in the environment. Although action recognition has been widely used in various aspects of society, there are still many problems in this field that need to be solved in the real world, such as perspective transformation problems, action scale differences, etc. At the same time, how to quickly and effectively obtain the intrinsic connections in multimodal action information and perform efficient modeling is also a challenging problem.

发明内容Summary of the invention

有鉴于此，本发明提供了一种可以提高动作识别精确性、实现多模态数据融合和有效控制网络参数的数量的基于多模态序列融合的动作识别方法及系统，来解决上述存在的技术问题，具体采用以下技术方案来实现。In view of this, the present invention provides an action recognition method and system based on multimodal sequence fusion, which can improve the accuracy of action recognition, realize multimodal data fusion and effectively control the number of network parameters, so as to solve the above-mentioned technical problems, and specifically adopts the following technical solutions to achieve this.

第一方面，本发明提供了一种基于多模态序列融合的动作识别方法，包括以下步骤：In a first aspect, the present invention provides an action recognition method based on multimodal sequence fusion, comprising the following steps:

获取待识别的人体动作视频，将所述动作视频进行动作标注得到视频帧，其中，所述动作标注包括动作的语义分割和时间线分割标签；Obtain a human action video to be identified, and perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分，采用连续帧之间的相关性进度动作时域检测，定位出动作发生的时间段，将每一帧的检测结果连接已形成动作的时空通道，使所述视频帧进行预处理得到人体动作对应的数据集；The spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected. The correlation between consecutive frames is used to perform action time domain detection, and the time period in which the action occurs is located. The detection result of each frame is connected to the spatiotemporal channel of the formed action, and the video frame is preprocessed to obtain a data set corresponding to the human body action;

采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征；Using convolutional neural networks and long short-term memory networks to extract features from a data set to obtain behavioral features and time domain features corresponding to the behavioral features;

根据所述行为特征和时域特征构建网络模型，将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。A network model is constructed according to the behavioral characteristics and time domain characteristics, and multiple modal information is input into the network model for feature fusion and classification to complete human action recognition.

作为上述技术方案的进一步改进，将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。包括：As a further improvement of the above technical solution, multiple modal information is input into the network model for feature fusion and classification to complete human action recognition. Including:

将多模态数据之间的交互信息作为多个模态数据所包含的共同特征，采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算，得到多模态个特征元素的信息关联张量；The interactive information between multi-modal data is regarded as the common features contained in multiple modal data. The tensor fusion algorithm is used to perform volume operation on the feature vectors corresponding to multiple different modal data to obtain the information correlation tensor of multi-modal feature elements.

在计算体积时，每个特征向量加一维1，以在网络模型中保持单模态输入特征，其表达式为

其中模态特征

表示外积运算；三个特征向量分别为a、g和v做外积运算的表达式为

When calculating the volume, each feature vector is added with a dimension of 1 to maintain the unimodal input features in the network model, and its expression is:

The modal characteristics

represents the outer product operation; the expression for the outer product operation of the three eigenvectors a, g and v is

采用塔克分解对原始张量压缩，将权重杂行量表示为四个正交矩阵和一个核心张量的乘积，则四位权重张量τ进行分解t＝((t_c×₁W_a)×₂W_g)×₃W_v×₄W₀，分解出的核心张量

用于三个模态的交互，经过约束下参数个数的函数，维度大小控制整个融合模态的复杂程度，保存所有的特征向量到融合特征的映射，则三线性模型表达式为

其中

和

分别表示将各特征向量投影到各自的低维空间中，t_a、t_g和t_v越大，需要训练参数越多，模型的复杂度越高，最后通过

控制融合特征向量的维度。Tucker decomposition is used to compress the original tensor, and the weighted matrix is expressed as the product of four orthogonal matrices and a core tensor. Then the four-bit weight tensor τ is decomposed into t = ((t _c × ₁ W _a ) × ₂ W _g ) × ₃ W _v × ₄ W ₀ , and the decomposed core tensor is

For the interaction of the three modes, the function of the number of parameters under the constraint, the dimension size controls the complexity of the entire fusion mode, and saves the mapping of all feature vectors to fusion features. The trilinear model expression is

in

and

Respectively represent the projection of each feature vector into its own low-dimensional space. The larger _ta , _tg and _tv are, the more parameters need to be trained and the higher the complexity of the model.

Controls the dimensionality of the fused feature vector.

作为上述技术方案的进一步改进，采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算，包括：As a further improvement of the above technical solution, a tensor fusion algorithm is used to perform volume calculation on the feature vectors corresponding to multiple different modal data, including:

将n模态融合特征张量进行线性映射后，得到(n+1)维的权重张量

对其进行张量塔克分解得到n+1个正交映射矩阵和一个核心张量的表达式为

令

再进一步引入秩约束进行分解，

表示一系列张量的哈密顿积，

为第i个模态的秩分解因子。After linear mapping of the n-modal fusion feature tensor, we get the (n+1)-dimensional weight tensor

Performing tensor Tucker decomposition on it, we get n+1 orthogonal mapping matrices and a core tensor expression:

make

Then we further introduce rank constraints to decompose:

represents the Hamiltonian product of a series of tensors,

is the rank decomposition factor of the i-th mode.

作为上述技术方案的进一步改进，根所述行为特征和时域特征构建网络模型，包括：As a further improvement of the above technical solution, a network model is constructed based on the behavioral characteristics and time domain characteristics, including:

经过多模态特征融合得到多模态的语义特曾表示

再将其送入到神经网络层得到最终的跨模态语义特征表示

在每个时间点i预设一系列多尺度的时序候选片段

其中

表示第j个候选片段在时间点i处的开始和结束时间边界，W_j表示预先设定的第j个片段的时间宽度，

表示候选片段的总数；After multimodal feature fusion, multimodal semantic features are obtained.

Then send it to the neural network layer to obtain the final cross-modal semantic feature representation

At each time point i, a series of multi-scale temporal candidate segments are preset

in

represents the start and end time boundaries of the j-th candidate segment at time point i, _Wj represents the preset time width of the j-th segment,

represents the total number of candidate segments;

通过sigmoid激活函数(σ)评估候选片段的置信度得分的表达式为cs_i＝σ(Convld(h_i))，其中

表示

个候选片段在时间点i处的得分，该得分用于表示视频片段和文本描述的相似度，同时为每个候选片段计算其相应的预测时序边界偏移量的表达式为

表示在时间点i处的预测时序开始和结束偏移量，最终在时间点i处的预测片段j表示为

The expression for evaluating the confidence score of a candidate segment by the sigmoid activation function (σ) is cs _i =σ(Convld( _hi )), where

express

The score of the candidate segment at time point i is used to represent the similarity between the video segment and the text description. At the same time, the expression for calculating the corresponding predicted temporal boundary offset for each candidate segment is:

represents the start and end offsets of the prediction sequence at time point i, and finally the prediction segment j at time point i is expressed as

作为上述技术方案的进一步改进，损失函数有用于计算视频片段与文本描述匹配分数的损失函数和用于计算时序边界偏移量的损失函数组成，计算每个候选时序片段

和目标片段(s，e)的时间重叠交并比IoU，若小于预设的阈值λ，则将IoU设置为0，若大于阈值λ，则确定该候选片段为正样本，否则为负样本，匹配损失函数的表达式为

其中N_pos表示候选时序片段正样本个数，N_neg表示负样本个数；As a further improvement of the above technical solution, the loss function consists of a loss function for calculating the matching score between the video segment and the text description and a loss function for calculating the temporal boundary offset.

The temporal overlap intersection over union (IoU) of the target segment (s, e) is calculated. If it is less than the preset threshold λ, the IoU is set to 0. If it is greater than the threshold λ, the candidate segment is determined to be a positive sample, otherwise it is a negative sample. The expression of the matching loss function is:

Where N _pos represents the number of positive samples of the candidate time series segments, and N _neg represents the number of negative samples;

采用边界回归策略调整时间定位的偏移量，计算候选片段与目标片段的IoU，并选择大于设定阈值γ的候选时序片段集合C_h，计算这些候选时序片段的时序边界偏移量的表达式为

其中(s，e)表示给定文本描述的开始和结束时间点

对应候选时序视频片段集合C_h的开始和结束时间点；The boundary regression strategy is used to adjust the offset of temporal positioning, the IoU between the candidate segment and the target segment is calculated, and the candidate temporal segment set _Ch that is greater than the set threshold γ is selected. The expression for calculating the temporal boundary offset of these candidate temporal segments is:

Where (s, e) represents the start and end time points of a given text description

The starting and ending time points of the corresponding candidate time-series video clip set _Ch ;

使用δ＝[δ^s，δ^e]表示真实的时间定位偏移量，

表示预测的时间定位偏移量，基于真实的时间定位偏移量自适应调整档期候选片段的时序边界，

其中SL₁表示L₁范数，N表示C_h集合的大小。Use δ = [δ ^s , δ ^e ] to represent the actual time positioning offset,

Represents the predicted time positioning offset, and adaptively adjusts the timing boundaries of the candidate segments based on the actual time positioning offset.

where SL ₁ represents the L ₁ norm and N represents the size of the _Ch set.

作为上述技术方案的进一步改进，采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征，包括：As a further improvement of the above technical solution, a convolutional neural network and a long short-term memory network are used to extract features from the data set to obtain behavioral features and time domain features corresponding to the behavioral features, including:

将原始惯性传感器数据作为一张图片即时间×通道，采用长短期极意网络LSTM和卷积神经网络CNN，采用一维卷积操作获取卷积核窗口内的时间信号结构，通过卷积神经网络获取惯性传感器信号本身的关键行为特征，一维卷积计算表达式为

其中N表示卷积核长度，D表示传感器数据和卷积核深度，

表示一维卷积核深度d₀中第n个权重，

表示深度d₀下传感器信号的第i₀个元素，

表示传感器通过卷积运算得到的第i₀个特征，f(*)表示激活函数；The original inertial sensor data is taken as a picture, i.e., time × channel. The long short-term extreme network (LSTM) and convolutional neural network (CNN) are used. The one-dimensional convolution operation is used to obtain the time signal structure within the convolution kernel window. The key behavioral characteristics of the inertial sensor signal itself are obtained through the convolutional neural network. The one-dimensional convolution calculation expression is:

Where N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel,

represents the nth weight in the one-dimensional convolution kernel depth d ₀ ,

represents the i _0th element of the sensor signal at depth d ₀ ,

represents the i _0th feature obtained by the sensor through convolution operation, and f(*) represents the activation function;

经过池化得到的特征大小的表达式为

表示当前i₀层的特征长度，P表示填补大小，S表示行进的步长，通过三次卷积和池化运算生成时序上低维的高层特征，再按时间顺序将CNN处理后的行为特征输入至长短期记忆网络，其中，长短期记忆网络包括两个LSTM层，每层LSTM采用单向连接，隐藏点个数设为128，将前部分得到的行为特征转换为128维的时序特征，对时间信息动态建模。The expression of the feature size obtained after pooling is

Represents the feature length of the current i ₀ layer, P represents the padding size, S represents the step length, and generates low-dimensional high-level features in time series through three convolution and pooling operations. Then, the behavioral features processed by CNN are input into the long short-term memory network in chronological order. The long short-term memory network includes two LSTM layers. Each LSTM layer adopts unidirectional connection, and the number of hidden points is set to 128. The behavioral features obtained in the previous part are converted into 128-dimensional time series features to dynamically model the time information.

作为上述技术方案的进一步改进，将每一帧的检测结果连接已形成动作的时空通道，使所述视频帧进行预处理得到人体动作对应的数据集，包括：As a further improvement of the above technical solution, the detection result of each frame is connected to the spatiotemporal channel that has formed the action, so that the video frame is preprocessed to obtain a data set corresponding to the human body action, including:

将未分割的原始视频表示为

其中xn表示视频X的第n帧图像，w表示视频X中的帧数，该视频中包含的所有动作集合可用一组实例

来表示，其中N_g表示视频X中真实动作实例的数据，t_s,i、t_e，i分别表示动作实例

的开始节点和结束节点，ψ_g表示动作实例；The unsegmented original video is represented as

Where xn represents the nth frame image of video X, w represents the number of frames in video X, and all action sets contained in the video can be represented by a set of instances.

It is represented by Ng, where _Ng represents the data of real action instances in video X, and _ts,i and te _,i represent action instances respectively.

The start and end nodes of ψ _g represent action instances;

获取收入的视频和文本构建序列化特征，根据预设的时间长度在视频序列特征的每个时间点生成多尺度的候选时序视频片段，采用时序协同注意力交互网络对候选时序片段与文本序列特征进行特征交互与融合，得到在同一特征空间嵌入的多模态融合数据以得到对应的数据集。The collected videos and texts are obtained to construct serialized features, and multi-scale candidate time-series video clips are generated at each time point of the video sequence features according to the preset time length. The temporal collaborative attention interaction network is used to perform feature interaction and fusion on the candidate time-series clips and text sequence features, and multimodal fusion data embedded in the same feature space is obtained to obtain the corresponding data set.

作为上述技术方案的进一步改进，获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分，采用连续帧之间的相关性进度动作时域检测，包括：As a further improvement of the above technical solution, the spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected, and the correlation between consecutive frames is used to perform action time domain detection, including:

预设一个未分割视频序列V，先按照固定帧率对视频信号进行等间隔采样得到图像帧序列，并将其分割为M个长度相等的、无重叠的视频片段单元{v₁，...，v_j，...，v_M}，使用位置编码函数为视频图像添加额外的时序位置信息，视频单元特征

的表达式为

其中

d与d_v表示视频编码特征维度以及提取的视频单元特征维度；A pre-segmented video sequence V is assumed. The video signal is sampled at equal intervals at a fixed frame rate to obtain an image frame sequence, which is then segmented into M video segments {v ₁ , ..., v _j , ..., v _M } of equal length and without overlap. The position encoding function is used to add additional temporal position information to the video image. The video unit feature

The expression is

in

d and d _v represent the video coding feature dimension and the extracted video unit feature dimension;

采用单元协同注意力交互层构建视频和文本之间的交互信息，预设输入视频和文本描述的特征表示为Vⁱⁿ和Sⁱⁿ，通过线性映射将d维的特征向量变换为查询向量(Q_f，Q_s)、键向量(K_f，K_s)和值向量(V_f，，V_s)，则对应的表达式为

其中W_sk、W_fk、W_sq、W_fq、W_sv和W_fv表示可学习的权重矩阵，将离子视频模态的Q_f特征作为查询向量，来自文本模态的K_s与V_s特征分别作为键向量和值向量，并计算他们之间的相似度权重矩阵，得到加权后的视频特征为

将文本和视频特征的上下文信息融入到当前位置的特征向量中得到对应的时域特征。The unit collaborative attention interaction layer is used to construct the interactive information between video and text. The feature representations of the input video and text description are preset as ^Vin and ^Sin . The d-dimensional feature vector is transformed into a query vector ( _Qf , _Qs ), a key vector ( _Kf , _Ks ) and a value vector ( _Vf , _Vs ) through linear mapping. The corresponding expression is:

Where _Wsk , _Wfk , _Wsq , _Wfq , _Wsv and _Wfv represent learnable weight matrices. The _Qf feature of the ion video modality is used as the query vector, and the _Ks and _Vs features from the text modality are used as the key vector and value vector respectively. The similarity weight matrix between them is calculated, and the weighted video feature is obtained as

The contextual information of text and video features is integrated into the feature vector of the current position to obtain the corresponding time domain features.

作为上述技术方案的进一步改进，获取待识别的人体动作视频，将所述动作视频进行动作标注得到视频帧，包括：As a further improvement of the above technical solution, obtaining a human action video to be identified, and annotating the action video to obtain a video frame includes:

采用惯性传感器获取身体部位的运动特性，每次读入未标注的视频是，切换摄像头进行标注，标注完某个动作后，通过该动作起止时的截图判断是否正确，若出现标记错误的地方，则对摄像头进行微调或重新标注。Inertial sensors are used to obtain the motion characteristics of body parts. Each time an unlabeled video is read in, the camera is switched to label it. After labeling an action, the screenshots of the start and end of the action are used to determine whether it is correct. If there are incorrectly marked areas, the camera is fine-tuned or re-labeled.

第二方面，本发明提供了一种基于多模态序列融合的动作识别系统，包括：In a second aspect, the present invention provides an action recognition system based on multimodal sequence fusion, comprising:

第一获取模块，用于获取待识别的人体动作视频，将所述动作视频进行动作标注得到视频帧，其中，所述动作标注包括动作的语义分割和时间线分割标签；A first acquisition module is used to acquire a human action video to be identified, and to perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

第二获取模块，用于获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分，采用连续帧之间的相关性进度动作时域检测，定位出动作发生的时间段，将每一帧的检测结果连接已形成动作的时空通道，使所述视频帧进行预处理得到人体动作对应的数据集；The second acquisition module is used to obtain the spatial position corresponding to the human body action and detect the spatial candidate frame coordinates and action category division of each frame action, use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, connect the detection result of each frame to the spatiotemporal channel of the formed action, and pre-process the video frame to obtain the data set corresponding to the human body action;

特征提取模块，用于采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征；A feature extraction module, used to extract features from a data set using a convolutional neural network and a long short-term memory network to obtain behavioral features and time domain features corresponding to the behavioral features;

识别模块，用于根据所述行为特征和时域特征构建网络模型，将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。The recognition module is used to build a network model according to the behavioral characteristics and time domain characteristics, and input multiple modal information into the network model for feature fusion and classification to complete human action recognition.

本发明提供了一种基于多模态序列融合的动作识别方法及系统，通过获取待识别的人体动作视频，将所述动作视频进行动作标注得到视频帧，获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分，采用连续帧之间的相关性进度动作时域检测，定位出动作发生的时间段，将每一帧的检测结果连接已形成动作的时空通道，使所述视频帧进行预处理得到人体动作对应的数据集，采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征，根据所述行为特征和时域特征构建网络模型，将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别，采用多模态对人体行为进行识别并进行多模态融合，可以增强在现实场景下模型的精确度和鲁棒性，有效降低后期对特征向量操作过程中语义丢失的问题，特征向量内部的乘积能够充分发挥不同模态中的互补信息，从而提高了习惯工作的稳定性和识别准确性。The present invention provides an action recognition method and system based on multimodal sequence fusion, which obtains a human action video to be recognized, annotates the action video to obtain a video frame, obtains the spatial position corresponding to the human action and detects the spatial candidate frame coordinates and action category classification of each frame action, uses the correlation between consecutive frames to perform action time domain detection, locates the time period in which the action occurs, connects the detection result of each frame to the spatiotemporal channel in which the action has been formed, pre-processes the video frame to obtain a data set corresponding to the human action, uses a convolutional neural network and a long short-term memory network to extract features from the data set to obtain behavioral features and time domain features corresponding to the behavioral features, builds a network model based on the behavioral features and time domain features, inputs multiple modal information into the network model for feature fusion and classification to complete human action recognition, uses multimodality to recognize human behavior and performs multimodal fusion, can enhance the accuracy and robustness of the model in real scenarios, effectively reduce the problem of semantic loss in the later stage of feature vector operation, and the product inside the feature vector can give full play to the complementary information in different modalities, thereby improving the stability of habitual work and recognition accuracy.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案，下面将对实施例中所需要使用的附图作简单地介绍，应当理解，以下附图仅示出了本发明的某些实施例，因此不应被看作是对范围的限定，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments are briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present invention and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without creative work.

图1为本发明提供的基于多模态序列融合的动作识别方法的流程图；FIG1 is a flow chart of an action recognition method based on multimodal sequence fusion provided by the present invention;

图2为本发明提供的基于多模态序列融合的动作识别系统的结构框图。FIG2 is a structural block diagram of an action recognition system based on multimodal sequence fusion provided by the present invention.

具体实施方式DETAILED DESCRIPTION

下面详细描述本发明的实施例，所述实施例的示例在附图中示出，其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的，仅用于解释本发明，而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and cannot be understood as limiting the present invention.

参阅图1，本发明提供了一种基于多模态序列融合的动作识别方法，包括以下步骤：Referring to FIG1 , the present invention provides an action recognition method based on multimodal sequence fusion, comprising the following steps:

S1：获取待识别的人体动作视频，将所述动作视频进行动作标注得到视频帧，其中，所述动作标注包括动作的语义分割和时间线分割标签；S1: Obtain a human action video to be identified, and perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

S2：获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分，采用连续帧之间的相关性进度动作时域检测，定位出动作发生的时间段，将每一帧的检测结果连接已形成动作的时空通道，使所述视频帧进行预处理得到人体动作对应的数据集；S2: Obtain the spatial position corresponding to the human action and detect the spatial candidate frame coordinates and action category division of each frame action, use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, connect the detection result of each frame to the spatiotemporal channel of the formed action, and pre-process the video frame to obtain the data set corresponding to the human action;

S3：采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征；S3: Using a convolutional neural network and a long short-term memory network to extract features from the data set to obtain behavioral features and time domain features corresponding to the behavioral features;

S4：根据所述行为特征和时域特征构建网络模型，将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。S4: constructing a network model according to the behavioral characteristics and time domain characteristics, inputting multiple modal information into the network model for feature fusion and classification to complete human action recognition.

本实施例中，将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。包括：将多模态数据之间的交互信息作为多个模态数据所包含的共同特征，采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算，得到多模态个特征元素的信息关联张量；在计算体积时，每个特征向量加一维1，以在网络模型中保持单模态输入特征，其表达式为

其中模态特征

采用塔克分解对原始张量压缩，将权重杂行量表示为四个正交矩阵和一个核心张量的乘积，则四位权重张量τ进行分解t＝((t_c×₁W_a)×₂、N_g)×₃W_v×₄W₀，分解出的核心张量

其中

和

控制融合特征向量的维度。In this embodiment, multiple modal information is input into the network model for feature fusion and classification to complete human action recognition. This includes: using the interactive information between multimodal data as the common features contained in the multiple modal data, using the tensor fusion algorithm to perform volume calculation on the feature vectors corresponding to multiple different modal data, and obtaining the information association tensor of multimodal feature elements; when calculating the volume, each feature vector is added with one dimension 1 to maintain the single modal input feature in the network model, and its expression is:

The modal characteristics

Tucker decomposition is used to compress the original tensor, and the weighted matrix is expressed as the product of four orthogonal matrices and a core tensor. Then the four-bit weight tensor τ is decomposed into t = ((t _c × ₁ W _a ) × ₂ , N _g ) × ₃ W _v × ₄ W ₀ , and the decomposed core tensor is

in

and

Controls the dimensionality of the fused feature vector.

需要说明的是，获取待识别的人体动作视频，将所述动作视频进行动作标注得到视频帧，包括：采用惯性传感器获取身体部位的运动特性，每次读入未标注的视频是，切换摄像头进行标注，标注完某个动作后，通过该动作起止时的截图判断是否正确，若出现标记错误的地方，则对摄像头进行微调或重新标注。动作识别包括离线动作识别和在线动作识别，离线动作识别需要在观测到整个视频序列后，对视频中发生的人体动作类别进行判定，该任务根据预设的动作列表为每个视频分配一个动作类别标签。在线动作识别面向实际场景需求，要求实时处理在线的视频流。动作检测包括时序动作检测和时空动作检测，对于未经裁剪的视频序列，时序动作检测任务是定位出目标动作的开始和结束时间点以及相应的动作类别，时空动作检测在此基础上还需要预测动作发生的空间位置。时序动作检测包括生成具有精确时间边界的候选动作片段和为时序候选动作片段进行动作类别的划分，时空动作检测任务的一般流程为：先捕捉人体动作的空间位置即检测出每一帧动作的空间候选框坐标和动作类别得分，然后采用连续帧之间的相关性进行动作时域检测，定位出动作发生的时间段，最后将每一帧的检测结果连接，形成动作的时空通道。It should be noted that obtaining a human action video to be identified and annotating the action video to obtain a video frame includes: using an inertial sensor to obtain the motion characteristics of body parts, switching the camera for annotation each time an unannotated video is read in, and after annotating a certain action, judging whether it is correct by taking screenshots at the start and end of the action. If there is an incorrectly marked area, the camera is fine-tuned or re-annotated. Action recognition includes offline action recognition and online action recognition. Offline action recognition needs to determine the category of human actions occurring in the video after observing the entire video sequence. This task assigns an action category label to each video according to a preset action list. Online action recognition is oriented to actual scene requirements and requires real-time processing of online video streams. Action detection includes temporal action detection and spatiotemporal action detection. For uncropped video sequences, the temporal action detection task is to locate the start and end time points of the target action and the corresponding action category. On this basis, spatiotemporal action detection also needs to predict the spatial position of the action. Temporal action detection includes generating candidate action segments with precise time boundaries and classifying the action categories for the temporal candidate action segments. The general process of the spatiotemporal action detection task is: first capture the spatial position of the human body action, that is, detect the spatial candidate frame coordinates and action category scores of each frame of the action, then use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, and finally connect the detection results of each frame to form a spatiotemporal channel of the action.

应理解，短期动作预测根据滚测到的局部视频片段在动作执行的初期预测该视频中正在发生的动作类别，长期动作预测是通过观测视频中当前时刻的人体动作推测未来可能发生的动作类别。对于时序动作检测任务，通常使用平均精确度均值mAP指标来衡量算法的性能，先对每个动作类别分别计算检测结果的平均精确度AP，再对这些平均精确度求均值得到平均精确度均值mAP。当一个预测时序动作片段与其相应的真值片段的时序交并比IoU大于一个设定的阈值，且动作类别预测正确，同时满足这两种情况被认为是正确的检测结果，时序交并比IoU为预测时序动作片段与对应真值片段的交并比，其计算公式为

其中T_p和T_g表示算法预测的时序区间与真值区间，ζ(·)函数用于计算区间长度。通过对人体动作采集、动作检测和动作预测，可以提高动作识别的精准性，也提高了人机交互的便利性。It should be understood that short-term action prediction predicts the action category occurring in the video at the beginning of the action execution based on the local video clips measured by rolling, and long-term action prediction infers the action category that may occur in the future by observing the human body movements at the current moment in the video. For the task of sequential action detection, the mean average precision (mAP) indicator is usually used to measure the performance of the algorithm. The mean precision AP of the detection results is first calculated for each action category, and then the average of these average accuracies is calculated to obtain the mean average precision (mAP). When the temporal intersection over union (IoU) of a predicted sequential action segment and its corresponding true value segment is greater than a set threshold, and the action category prediction is correct, and both conditions are met, it is considered to be a correct detection result. The temporal intersection over union (IoU) is the intersection over union (IoU) of the predicted sequential action segment and the corresponding true value segment, and its calculation formula is:

Where T _p and T _g represent the time interval predicted by the algorithm and the true value interval, and the ζ(·) function is used to calculate the interval length. By collecting, detecting and predicting human motion, the accuracy of motion recognition can be improved, and the convenience of human-computer interaction can also be improved.

可选地，采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算，包括：Optionally, a tensor fusion algorithm is used to perform volume calculation on feature vectors corresponding to a plurality of different modal data, including:

令

再进一步引入秩约束进行分解，

表示一系列张量的哈密顿积，

make

Then we further introduce rank constraints to decompose:

represents the Hamiltonian product of a series of tensors,

is the rank decomposition factor of the i-th mode.

本实施例中，根所述行为特征和时域特征构建网络模型，包括：经过多模态特征融合得到多模态的语义特曾表示

再将其送入到神经网络层得到最终的跨模态语义特征表示

在每个时间点i预设一系列多尺度的时序候选片段

其中

表示候选片段的总数；通过sigmoid激活函数(σ)评估候选片段的置信度得分的表达式为cs_i＝σ(Convld(h_i))，其中

表示

In this embodiment, a network model is constructed based on the behavioral features and the time domain features, including: obtaining a multimodal semantic feature representation by multimodal feature fusion

in

represents the total number of candidate segments; the expression for evaluating the confidence score of the candidate segment by the sigmoid activation function (σ) is cs _i =σ(Convld( _hi )), where

express

需要说明的是，损失函数有用于计算视频片段与文本描述匹配分数的损失函数和用于计算时序边界偏移量的损失函数组成，计算每个候选时序片段

其中N_pos表示候选时序片段正样本个数，N_neg表示负样本个数；采用边界回归策略调整时间定位的偏移量，计算候选片段与目标片段的IoU，并选择大于设定阈值Y的候选时序片段集合C_h，计算这些候选时序片段的时序边界偏移量的表达式为

其中(s，e)表示给定文本描述的开始和结束时间点

对应候选时序视频片段集合C_h的开始和结束时间点；使用δ＝[δ^s，δ^e]表示真实的时间定位偏移量，

其中SL₁表示L₁范数，N表示C_h集合的大小。It should be noted that the loss function consists of a loss function for calculating the matching score between the video clip and the text description and a loss function for calculating the temporal boundary offset.

Where N _pos represents the number of positive samples of candidate time series segments, and N _neg represents the number of negative samples. The boundary regression strategy is used to adjust the offset of time positioning, calculate the IoU between the candidate segment and the target segment, and select the set of candidate time series segments _Ch that is greater than the set threshold Y. The expression for calculating the time boundary offset of these candidate time series segments is:

Corresponding to the start and end time points of the candidate temporal video segment set _Ch ; using δ = [δ ^s , δ ^e ] to represent the actual time positioning offset,

应理解，塔克分解是主成分分析的多线性形式，每个张量可以不唯一地表示为核心张量即主成分因子和所有阶上的因子矩阵的乘积，使用塔克分解具有的优势为：与需要评估秩的大小和逼近初始张量的CP分解相比，使用塔克分解能获得更精确的张量分解结果，也可以通过调整核心张量维度来实现对每个模态特征向量进行特征选择的目的。为了进一步减少融合模型的计算复杂度，平衡交互融合建模的复杂性和表达性，根据核心张量的稀疏性，引入结构化稀疏约束，将权重核心张量分解为多个因子，秩约束在训练的过程中作为正则化来防止过拟合，能够灵活的调整输入和数据的映射。It should be understood that Tucker decomposition is a multilinear form of principal component analysis. Each tensor can be non-uniquely represented as the core tensor, i.e., the product of the principal component factor and the factor matrix at all orders. The advantages of using Tucker decomposition are: compared with the CP decomposition that requires evaluating the size of the rank and approximating the initial tensor, using Tucker decomposition can obtain more accurate tensor decomposition results, and can also achieve the purpose of feature selection for each modal eigenvector by adjusting the core tensor dimension. In order to further reduce the computational complexity of the fusion model and balance the complexity and expressiveness of interactive fusion modeling, structured sparse constraints are introduced according to the sparsity of the core tensor, and the weight core tensor is decomposed into multiple factors. The rank constraint is used as regularization to prevent overfitting during the training process, and the mapping of input and data can be flexibly adjusted.

可选地，采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征，包括：Optionally, a convolutional neural network and a long short-term memory network are used to extract features from a data set to obtain behavioral features and time domain features corresponding to the behavioral features, including:

其中N表示卷积核长度，D表示传感器数据和卷积核深度，

表示一维卷积核深度d₀中第n个权重，

表示深度d₀下传感器信号的第i₀个元素，

represents the i _0th element of the sensor signal at depth d ₀ ,

经过池化得到的特征大小的表达式为

本实施例中，卷积神经网络件原始惯性传感器数据当作一张图片，根据传感器图像中存在固有的局部模式，利用共享卷积核对图像进行卷积操作，但从其中提取部分特征，而对于隐藏在其中的时序信息没有做进一步处理，忽略了人体行为的连续性，长短期记忆网络是将原始惯性传感器信号直接输入到LSTM中，缺少对传感器数据的整合，导致算法运行速度比较慢，虽然长短期记忆网络采用门控机制可以在一定程度上解决递归神经网络的梯度消失问题，但不能处理更长的时间序列信息。通过双层LSTM获取不同信号帧之间上下文关联的时域信息，采用门控机制选择性地保留输入的CNN提取特征中获得的行为信息，以更好地对惯性传感器信号特征进行时序激励，获得与行为识别相关的时空特征，实现空间-时间行为特征学习。In this embodiment, the convolutional neural network treats the original inertial sensor data as a picture. According to the inherent local pattern in the sensor image, the image is convolved using a shared convolution kernel, but some features are extracted from it, and the time series information hidden therein is not further processed, ignoring the continuity of human behavior. The long short-term memory network directly inputs the original inertial sensor signal into the LSTM, lacks the integration of sensor data, and causes the algorithm to run slowly. Although the long short-term memory network adopts a gating mechanism to solve the gradient disappearance problem of the recursive neural network to a certain extent, it cannot process longer time series information. The context-related time domain information between different signal frames is obtained through a double-layer LSTM, and the gating mechanism is used to selectively retain the behavior information obtained in the input CNN extraction feature, so as to better perform time series excitation on the inertial sensor signal features, obtain the spatiotemporal features related to behavior recognition, and realize space-time behavior feature learning.

可选地，将每一帧的检测结果连接已形成动作的时空通道，使所述视频帧进行预处理得到人体动作对应的数据集，包括：Optionally, the detection result of each frame is connected to the spatiotemporal channel of the formed action, so that the video frame is preprocessed to obtain a data set corresponding to the human body action, including:

将未分割的原始视频表示为

来表示，其中N_g表示视频X中真实动作实例的数据，t_s，i、t_e，i分别表示动作实例

The start and end nodes of ψ _g represent action instances;

本实施例中，获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分，采用连续帧之间的相关性进度动作时域检测，包括：预设一个未分割视频序列V，先按照固定帧率对视频信号进行等间隔采样得到图像帧序列，并将其分割为M个长度相等的、无重叠的视频片段单元{v₁，...，v_j，...，v_M}，使用位置编码函数为视频图像添加额外的时序位置信息，视频单元特征

的表达式为

其中

d与d_v表示视频编码特征维度以及提取的视频单元特征维度；采用单元协同注意力交互层构建视频和文本之间的交互信息，预设输入视频和文本描述的特征表示为Vⁱⁿ和Sⁱⁿ，通过线性映射将d维的特征向量变换为查询向量(Q_f，Q_s)、键向量(K_f，K_s)和值向量(V_f，，V_s)，则对应的表达式为

将文本和视频特征的上下文信息融入到当前位置的特征向量中得到对应的时域特征。In this embodiment, the spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected, and the correlation between consecutive frames is used to perform action time domain detection, including: presetting an unsegmented video sequence V, first sampling the video signal at equal intervals according to a fixed frame rate to obtain an image frame sequence, and dividing it into M video clip units {v ₁ , ..., v _j , ..., v _M } of equal length and non-overlapping, using a position encoding function to add additional temporal position information to the video image, and the video unit feature

The expression is

in

d and d _v represent the video encoding feature dimension and the extracted video unit feature dimension; the unit collaborative attention interaction layer is used to construct the interactive information between video and text. The feature representations of the input video and text description are preset as ^Vin and ^Sin . The d-dimensional feature vector is transformed into a query vector ( _Qf , _Qs ), a key vector ( _Kf , _Ks ) and a value vector ( _Vf ,, _Vs ) through linear mapping. The corresponding expression is:

本实施例中，模态是指表征信息的不同方式，包括对事物的各种感官方式，多模态是指两种或多种形式模态之间的组合，进行多模态融合的原因是不同的模态看待问题具有不同的表达式和不同的角度，多模态数据中存在各种不同的信息交叉和数据互补，多模态比单个模态效果好。在人体行为识别领域中，加速度、角速度和RGB视频图像数据三者属于异构类型的数据，都有各自的特性，惯性传感器只能获得身体部位的运动特性，对于精细的动作无法准确识别，如一些手部动作细节，而RGB视频受到遮挡物和光照的影响，当人体被遮挡时，只能通过惯性传感器进行识别。深度学习多模态特征层融合方法是级联融合、相加融合、级联融合方法是将多个模态特征向量进行拼接，从而增加了总体特征向量的维度。In this embodiment, modality refers to different ways of representing information, including various sensory ways of things, and multimodality refers to the combination of two or more forms of modalities. The reason for multimodal fusion is that different modalities have different expressions and different angles to look at problems. There are various different information intersections and data complementarity in multimodal data, and multimodality is better than a single modality. In the field of human behavior recognition, acceleration, angular velocity and RGB video image data are heterogeneous types of data, and each has its own characteristics. Inertial sensors can only obtain the motion characteristics of body parts and cannot accurately identify fine movements, such as some hand movement details. RGB videos are affected by occlusions and lighting. When the human body is occluded, it can only be identified by inertial sensors. The deep learning multimodal feature layer fusion method is cascade fusion, additive fusion, and cascade fusion method. It splices multiple modal feature vectors, thereby increasing the dimension of the overall feature vector.

参阅图2，本发明提供了一种基于多模态序列融合的动作识别系统，包括：Referring to FIG. 2 , the present invention provides an action recognition system based on multimodal sequence fusion, comprising:

本实施例中，人体行为识别是对采集到的用户运动信息和数据通过一定手段和方法进行相关地分类识别，以此来判断用户活动状态或检测用户行为。处理时间序列数据、时空结构和时间结构是过度行为，若不能充分利用时间特征，对于行为识别模型来将是一个很大的损失。通过获取待识别的人体动作视频，将所述动作视频进行动作标注得到视频帧，获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分，采用连续帧之间的相关性进度动作时域检测，定位出动作发生的时间段，将每一帧的检测结果连接已形成动作的时空通道，使所述视频帧进行预处理得到人体动作对应的数据集，采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征，根所述行为特征和时域特征构建网络模型，将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别，采用多模态对人体行为进行识别并进行多模态融合，可以增强在现实场景下模型的精确度和鲁棒性，有效降低后期对特征向量操作过程中语义丢失的问题，特征向量内部的乘积能够充分发挥不同模态中的互补信息，从而提高了习惯工作的稳定性和识别准确性。In this embodiment, human behavior recognition is to classify and identify the collected user movement information and data through certain means and methods, so as to judge the user's activity status or detect the user's behavior. Processing time series data, spatiotemporal structure and time structure is an excessive behavior. If the time feature cannot be fully utilized, it will be a great loss for the behavior recognition model. By acquiring a human action video to be identified, annotating the action video to obtain a video frame, obtaining the spatial position corresponding to the human action and detecting the spatial candidate frame coordinates and action category division of each frame action, using the correlation between consecutive frames to perform action time domain detection, locating the time period in which the action occurs, connecting the detection result of each frame to the spatiotemporal channel that has formed the action, and preprocessing the video frame to obtain a data set corresponding to the human action, using a convolutional neural network and a long short-term memory network to extract features from the data set to obtain behavioral features and time domain features corresponding to the behavioral features, constructing a network model based on the behavioral features and time domain features, inputting multiple modal information into the network model for feature fusion and classification to complete human action recognition, using multi-modality to identify human behavior and perform multi-modal fusion, the accuracy and robustness of the model in real scenarios can be enhanced, and the problem of semantic loss in the later operation of feature vectors can be effectively reduced. The product within the feature vector can give full play to the complementary information in different modalities, thereby improving the stability of habitual work and recognition accuracy.

在这里示出和描述的所有示例中，任何具体值应被解释为仅仅是示例性的，而不是作为限制，因此，示例性实施例的其他示例可以具有不同的值。In all examples shown and described herein, any specific values should be interpreted as merely exemplary and not as limiting, and thus other examples of the exemplary embodiments may have different values.

应注意到：相似的标号和字母在下面的附图中表示类似项，因此，一旦某一项在一个附图中被定义，则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that similar reference numerals and letters denote similar items in the following drawings, and therefore, once an item is defined in one drawing, it does not require further definition and explanation in the subsequent drawings.

以上所述实施例仅表达了本发明的几种实施方式，其描述较为具体和详细，但并不能因此而理解为对本发明范围的限制。应当指出的是，对于本领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干变形和改进，这些都属于本发明的保护范围。The above-mentioned embodiments only express several implementation methods of the present invention, and the description thereof is relatively specific and detailed, but it cannot be understood as limiting the scope of the present invention. It should be pointed out that for ordinary technicians in this field, several variations and improvements can be made without departing from the concept of the present invention, which all belong to the protection scope of the present invention.

Claims

1. An action recognition method based on multimodal sequence fusion, characterized in that it comprises the following steps:

Obtain a human action video to be identified, and perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

The spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected. The correlation between consecutive frames is used to perform action time domain detection, and the time period in which the action occurs is located. The detection result of each frame is connected to the spatiotemporal channel of the formed action, and the video frame is preprocessed to obtain a data set corresponding to the human body action;

Using convolutional neural networks and long short-term memory networks to extract features from a data set to obtain behavioral features and time domain features corresponding to the behavioral features;

A network model is constructed according to the behavioral characteristics and time domain characteristics, and multiple modal information is input into the network model for feature fusion and classification to complete human action recognition.

2. The method for action recognition based on multimodal sequence fusion according to claim 1, characterized in that multiple modal information is input into the network model for feature fusion and classification to complete human action recognition. It includes:

The interactive information between multi-modal data is regarded as the common features contained in multiple modal data. The tensor fusion algorithm is used to perform volume operation on the feature vectors corresponding to multiple different modal data to obtain the information correlation tensor of multi-modal feature elements.

The modal characteristics

Tucker decomposition is used to compress the original tensor, and the weighted matrix is expressed as the product of four orthogonal matrices and a core tensor. Then the four-bit weight tensor τ is decomposed into t = ((t _c × ₁ W _a ) × ₂ W _g ) × ₃ W _v × ₄ W ₀ , and the decomposed core tensor is

in

and

Controls the dimensionality of the fused feature vector.

3. The method for motion recognition based on multimodal sequence fusion according to claim 2 is characterized in that a tensor fusion algorithm is used to perform volume calculation on feature vectors corresponding to a plurality of different modal data, comprising:

After linear mapping of the n-modal fusion feature tensor, we get the (n+1)-dimensional weight tensor

make

y＝z ^T W ₀ ，

Then we further introduce rank constraints to decompose:

represents the Hamiltonian product of a series of tensors,

is the rank decomposition factor of the i-th mode.

4. The action recognition method based on multimodal sequence fusion according to claim 1 is characterized in that a network model is constructed based on the behavioral features and time domain features, comprising:

After multimodal feature fusion, multimodal semantic features are obtained.

in

represents the total number of candidate segments;

express

5. The method for action recognition based on multimodal sequence fusion according to claim 4 is characterized in that the loss function consists of a loss function for calculating the matching score between the video clip and the text description and a loss function for calculating the temporal boundary offset, and the loss function for calculating each candidate temporal clip is used to calculate the matching score between the video clip and the text description.

The boundary regression strategy is used to adjust the offset of temporal positioning, the IoU between the candidate segment and the target segment is calculated, and the candidate temporal segment set _Ch that is greater than the set threshold γ is selected. The expression for calculating the temporal boundary offset of these candidate temporal segments is:

Use δ = [δ ^s ,δ ^e ] to represent the actual time positioning offset,

6. The method for action recognition based on multimodal sequence fusion according to claim 1 is characterized in that a convolutional neural network and a long short-term memory network are used to extract features from a data set to obtain behavioral features and time domain features corresponding to the behavioral features, including:

The original inertial sensor data is taken as a picture, i.e., time × channel. The long short-term extreme network (LSTM) and convolutional neural network (CNN) are used. The one-dimensional convolution operation is used to obtain the time signal structure within the convolution kernel window. The key behavioral characteristics of the inertial sensor signal itself are obtained through the convolutional neural network. The one-dimensional convolution calculation expression is:

represents the i _0th element of the sensor signal at depth d ₀ ,

The expression of the feature size obtained after pooling is

7. The method for action recognition based on multimodal sequence fusion according to claim 1 is characterized in that the detection result of each frame is connected to the spatiotemporal channel that has formed the action, so that the video frame is preprocessed to obtain a data set corresponding to the human action, comprising:

The unsegmented original video is represented as

Where _xn represents the nth frame image of video X, w represents the number of frames in video X, and all action sets contained in the video can be represented by a set of instances.

The start and end nodes of ψ _g represent action instances;

The collected videos and texts are obtained to construct serialized features, and multi-scale candidate time-series video clips are generated at each time point of the video sequence features according to the preset time length. The temporal collaborative attention interaction network is used to perform feature interaction and fusion on the candidate time-series clips and text sequence features, and multimodal fusion data embedded in the same feature space is obtained to obtain the corresponding data set.

8. The method for action recognition based on multimodal sequence fusion according to claim 1 is characterized in that the spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected, and the correlation between consecutive frames is used to perform action time domain detection, including:

Preset an unsegmented video sequence V, first sample the video signal at equal intervals at a fixed frame rate to obtain an image frame sequence, and then segment it into M video clip units {v ₁ ,...,v _j ,...,v _M } of equal length and without overlap. Use the position encoding function to add additional temporal position information to the video image. The video unit feature

The expression is

in

The unit collaborative attention interaction layer is used to construct the interactive information between video and text. The feature representations of the input video and text description are preset as ^Vin and ^Sin . The d-dimensional feature vector is transformed into a query vector ( _Qf , _Qs ), a key vector ( _Kf , _Ks ) and a value vector ( _Vf ,, _Vs ) through linear mapping. The corresponding expression is:

9. The method for action recognition based on multimodal sequence fusion according to claim 1, characterized in that obtaining a human action video to be recognized and annotating the action video to obtain a video frame comprises:

Inertial sensors are used to obtain the motion characteristics of body parts. Each time an unlabeled video is read in, the camera is switched to label it. After labeling an action, the screenshots of the start and end of the action are used to determine whether it is correct. If there are incorrectly marked areas, the camera is fine-tuned or re-labeled.

10. An action recognition system based on multimodal sequence fusion according to the action recognition method based on multimodal sequence fusion according to any one of claims 1 to 8, characterized in that it comprises:

A first acquisition module is used to acquire a human action video to be identified, and to perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

The second acquisition module is used to obtain the spatial position corresponding to the human body action and detect the spatial candidate frame coordinates and action category division of each frame action, use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, connect the detection result of each frame to the spatiotemporal channel of the formed action, and pre-process the video frame to obtain the data set corresponding to the human body action;

A feature extraction module, used to extract features from a data set using a convolutional neural network and a long short-term memory network to obtain behavioral features and time domain features corresponding to the behavioral features;

The recognition module is used to build a network model according to the behavioral characteristics and time domain characteristics, and input multiple modal information into the network model for feature fusion and classification to complete human action recognition.