CN115937975A - Action recognition method and system based on multi-modal sequence fusion - Google Patents

Action recognition method and system based on multi-modal sequence fusion Download PDF

Info

Publication number
CN115937975A
CN115937975A CN202211568552.3A CN202211568552A CN115937975A CN 115937975 A CN115937975 A CN 115937975A CN 202211568552 A CN202211568552 A CN 202211568552A CN 115937975 A CN115937975 A CN 115937975A
Authority
CN
China
Prior art keywords
action
video
features
feature
time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211568552.3A
Other languages
Chinese (zh)
Inventor
曾国坤
刘予川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology Shenzhen
Original Assignee
Harbin Institute of Technology Shenzhen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology Shenzhen filed Critical Harbin Institute of Technology Shenzhen
Priority to CN202211568552.3A priority Critical patent/CN115937975A/en
Publication of CN115937975A publication Critical patent/CN115937975A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Image Analysis (AREA)

Abstract

The invention discloses an action recognition method based on multi-modal sequence fusion, which comprises the steps of obtaining a human action video to be recognized, carrying out action labeling on the action video to obtain a video frame, obtaining a spatial position corresponding to human action, detecting spatial candidate frame coordinates and action category division of each frame action, adopting correlation progress action time domain detection among continuous frames, locating action occurring time periods, preprocessing the video frame to obtain a data set, adopting a convolutional neural network and a long-short term memory network to carry out feature extraction on the data set to obtain action features, constructing a network model by combining the time domain features corresponding to the action features and the time domain features, inputting a plurality of modal information into the network model to carry out feature fusion and classification so as to complete the action recognition of a human body, adopting the human action to carry out recognition and multi-modal fusion, enhancing the accuracy and robustness of the model in a real scene, and improving the stability and recognition accuracy of working habits.

Description

一种基于多模态序列融合的动作识别方法及系统An action recognition method and system based on multimodal sequence fusion

技术领域Technical Field

本发明属于动作识别技术领域,尤其涉及一种基于多模态序列融合的动作识别方法及系统。The present invention belongs to the technical field of action recognition, and in particular relates to an action recognition method and system based on multimodal sequence fusion.

背景技术Background Art

人体行为识别一直以来是人机交互领域的一项热点技术,人体识别技术具有广泛的应用前景,能够带来良好的经济效益,许多实际场景都与它息息相关,如视频行为监视系统中识别危险活动、自动导航系统中感知人类的行为易实现安全操作。由于人类日常行为活动复杂且多样化,微小的动作变化可能会产生完全不同的行为,且随着所处环境的变化而变化,虽然动作识别已经广泛应用于社会的各个方面中,但是在现实世界中,该领域还有很多问题亟待解决,如视角变换问题、动作尺度差异问题等,同时,如何快速而有效地获取多模态动作信息中的内在联系并进行高效建模也是一项具有挑战性的问题。Human behavior recognition has always been a hot technology in the field of human-computer interaction. Human recognition technology has broad application prospects and can bring good economic benefits. Many practical scenarios are closely related to it, such as identifying dangerous activities in video behavior monitoring systems and sensing human behavior in automatic navigation systems to facilitate safe operations. Due to the complexity and diversity of human daily behavior activities, slight changes in movements may produce completely different behaviors, and change with changes in the environment. Although action recognition has been widely used in various aspects of society, there are still many problems in this field that need to be solved in the real world, such as perspective transformation problems, action scale differences, etc. At the same time, how to quickly and effectively obtain the intrinsic connections in multimodal action information and perform efficient modeling is also a challenging problem.

发明内容Summary of the invention

有鉴于此,本发明提供了一种可以提高动作识别精确性、实现多模态数据融合和有效控制网络参数的数量的基于多模态序列融合的动作识别方法及系统,来解决上述存在的技术问题,具体采用以下技术方案来实现。In view of this, the present invention provides an action recognition method and system based on multimodal sequence fusion, which can improve the accuracy of action recognition, realize multimodal data fusion and effectively control the number of network parameters, so as to solve the above-mentioned technical problems, and specifically adopts the following technical solutions to achieve this.

第一方面,本发明提供了一种基于多模态序列融合的动作识别方法,包括以下步骤:In a first aspect, the present invention provides an action recognition method based on multimodal sequence fusion, comprising the following steps:

获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,其中,所述动作标注包括动作的语义分割和时间线分割标签;Obtain a human action video to be identified, and perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,定位出动作发生的时间段,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集;The spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected. The correlation between consecutive frames is used to perform action time domain detection, and the time period in which the action occurs is located. The detection result of each frame is connected to the spatiotemporal channel of the formed action, and the video frame is preprocessed to obtain a data set corresponding to the human body action;

采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征;Using convolutional neural networks and long short-term memory networks to extract features from a data set to obtain behavioral features and time domain features corresponding to the behavioral features;

根据所述行为特征和时域特征构建网络模型,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。A network model is constructed according to the behavioral characteristics and time domain characteristics, and multiple modal information is input into the network model for feature fusion and classification to complete human action recognition.

作为上述技术方案的进一步改进,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。包括:As a further improvement of the above technical solution, multiple modal information is input into the network model for feature fusion and classification to complete human action recognition. Including:

将多模态数据之间的交互信息作为多个模态数据所包含的共同特征,采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算,得到多模态个特征元素的信息关联张量;The interactive information between multi-modal data is regarded as the common features contained in multiple modal data. The tensor fusion algorithm is used to perform volume operation on the feature vectors corresponding to multiple different modal data to obtain the information correlation tensor of multi-modal feature elements.

在计算体积时,每个特征向量加一维1,以在网络模型中保持单模态输入特征,其表达式为

Figure BDA0003987109560000021
其中模态特征
Figure BDA0003987109560000022
Figure BDA0003987109560000023
Figure BDA0003987109560000024
表示外积运算;三个特征向量分别为a、g和v做外积运算的表达式为
Figure BDA0003987109560000025
When calculating the volume, each feature vector is added with a dimension of 1 to maintain the unimodal input features in the network model, and its expression is:
Figure BDA0003987109560000021
The modal characteristics
Figure BDA0003987109560000022
Figure BDA0003987109560000023
Figure BDA0003987109560000024
represents the outer product operation; the expression for the outer product operation of the three eigenvectors a, g and v is
Figure BDA0003987109560000025

采用塔克分解对原始张量压缩,将权重杂行量表示为四个正交矩阵和一个核心张量的乘积,则四位权重张量τ进行分解t=((tc×1Wa2Wg3Wv×4W0,分解出的核心张量

Figure BDA00039871095600000210
用于三个模态的交互,经过约束下参数个数的函数,维度大小控制整个融合模态的复杂程度,保存所有的特征向量到融合特征的映射,则三线性模型表达式为
Figure BDA0003987109560000026
其中
Figure BDA0003987109560000027
Figure BDA0003987109560000028
分别表示将各特征向量投影到各自的低维空间中,ta、tg和tv越大,需要训练参数越多,模型的复杂度越高,最后通过
Figure BDA0003987109560000029
控制融合特征向量的维度。Tucker decomposition is used to compress the original tensor, and the weighted matrix is expressed as the product of four orthogonal matrices and a core tensor. Then the four-bit weight tensor τ is decomposed into t = ((t c × 1 W a ) × 2 W g ) × 3 W v × 4 W 0 , and the decomposed core tensor is
Figure BDA00039871095600000210
For the interaction of the three modes, the function of the number of parameters under the constraint, the dimension size controls the complexity of the entire fusion mode, and saves the mapping of all feature vectors to fusion features. The trilinear model expression is
Figure BDA0003987109560000026
in
Figure BDA0003987109560000027
and
Figure BDA0003987109560000028
Respectively represent the projection of each feature vector into its own low-dimensional space. The larger ta , tg and tv are, the more parameters need to be trained and the higher the complexity of the model.
Figure BDA0003987109560000029
Controls the dimensionality of the fused feature vector.

作为上述技术方案的进一步改进,采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算,包括:As a further improvement of the above technical solution, a tensor fusion algorithm is used to perform volume calculation on the feature vectors corresponding to multiple different modal data, including:

将n模态融合特征张量进行线性映射后,得到(n+1)维的权重张量

Figure BDA0003987109560000031
对其进行张量塔克分解得到n+1个正交映射矩阵和一个核心张量的表达式为
Figure BDA0003987109560000032
Figure BDA0003987109560000033
再进一步引入秩约束进行分解,
Figure BDA0003987109560000034
Figure BDA0003987109560000035
表示一系列张量的哈密顿积,
Figure BDA0003987109560000036
为第i个模态的秩分解因子。After linear mapping of the n-modal fusion feature tensor, we get the (n+1)-dimensional weight tensor
Figure BDA0003987109560000031
Performing tensor Tucker decomposition on it, we get n+1 orthogonal mapping matrices and a core tensor expression:
Figure BDA0003987109560000032
make
Figure BDA0003987109560000033
Then we further introduce rank constraints to decompose:
Figure BDA0003987109560000034
Figure BDA0003987109560000035
represents the Hamiltonian product of a series of tensors,
Figure BDA0003987109560000036
is the rank decomposition factor of the i-th mode.

作为上述技术方案的进一步改进,根所述行为特征和时域特征构建网络模型,包括:As a further improvement of the above technical solution, a network model is constructed based on the behavioral characteristics and time domain characteristics, including:

经过多模态特征融合得到多模态的语义特曾表示

Figure BDA0003987109560000037
Figure BDA0003987109560000038
再将其送入到神经网络层得到最终的跨模态语义特征表示
Figure BDA0003987109560000039
在每个时间点i预设一系列多尺度的时序候选片段
Figure BDA00039871095600000310
其中
Figure BDA00039871095600000311
表示第j个候选片段在时间点i处的开始和结束时间边界,Wj表示预先设定的第j个片段的时间宽度,
Figure BDA00039871095600000312
表示候选片段的总数;After multimodal feature fusion, multimodal semantic features are obtained.
Figure BDA0003987109560000037
Figure BDA0003987109560000038
Then send it to the neural network layer to obtain the final cross-modal semantic feature representation
Figure BDA0003987109560000039
At each time point i, a series of multi-scale temporal candidate segments are preset
Figure BDA00039871095600000310
in
Figure BDA00039871095600000311
represents the start and end time boundaries of the j-th candidate segment at time point i, Wj represents the preset time width of the j-th segment,
Figure BDA00039871095600000312
represents the total number of candidate segments;

通过sigmoid激活函数(σ)评估候选片段的置信度得分的表达式为csi=σ(Convld(hi)),其中

Figure BDA00039871095600000313
表示
Figure BDA00039871095600000314
个候选片段在时间点i处的得分,该得分用于表示视频片段和文本描述的相似度,同时为每个候选片段计算其相应的预测时序边界偏移量的表达式为
Figure BDA00039871095600000315
Figure BDA00039871095600000316
表示在时间点i处的预测时序开始和结束偏移量,最终在时间点i处的预测片段j表示为
Figure BDA00039871095600000317
The expression for evaluating the confidence score of a candidate segment by the sigmoid activation function (σ) is cs i =σ(Convld( hi )), where
Figure BDA00039871095600000313
express
Figure BDA00039871095600000314
The score of the candidate segment at time point i is used to represent the similarity between the video segment and the text description. At the same time, the expression for calculating the corresponding predicted temporal boundary offset for each candidate segment is:
Figure BDA00039871095600000315
Figure BDA00039871095600000316
represents the start and end offsets of the prediction sequence at time point i, and finally the prediction segment j at time point i is expressed as
Figure BDA00039871095600000317

作为上述技术方案的进一步改进,损失函数有用于计算视频片段与文本描述匹配分数的损失函数和用于计算时序边界偏移量的损失函数组成,计算每个候选时序片段

Figure BDA0003987109560000041
和目标片段(s,e)的时间重叠交并比IoU,若小于预设的阈值λ,则将IoU设置为0,若大于阈值λ,则确定该候选片段为正样本,否则为负样本,匹配损失函数的表达式为
Figure BDA0003987109560000042
其中Npos表示候选时序片段正样本个数,Nneg表示负样本个数;As a further improvement of the above technical solution, the loss function consists of a loss function for calculating the matching score between the video segment and the text description and a loss function for calculating the temporal boundary offset.
Figure BDA0003987109560000041
The temporal overlap intersection over union (IoU) of the target segment (s, e) is calculated. If it is less than the preset threshold λ, the IoU is set to 0. If it is greater than the threshold λ, the candidate segment is determined to be a positive sample, otherwise it is a negative sample. The expression of the matching loss function is:
Figure BDA0003987109560000042
Where N pos represents the number of positive samples of the candidate time series segments, and N neg represents the number of negative samples;

采用边界回归策略调整时间定位的偏移量,计算候选片段与目标片段的IoU,并选择大于设定阈值γ的候选时序片段集合Ch,计算这些候选时序片段的时序边界偏移量的表达式为

Figure BDA0003987109560000043
其中(s,e)表示给定文本描述的开始和结束时间点
Figure BDA00039871095600000410
对应候选时序视频片段集合Ch的开始和结束时间点;The boundary regression strategy is used to adjust the offset of temporal positioning, the IoU between the candidate segment and the target segment is calculated, and the candidate temporal segment set Ch that is greater than the set threshold γ is selected. The expression for calculating the temporal boundary offset of these candidate temporal segments is:
Figure BDA0003987109560000043
Where (s, e) represents the start and end time points of a given text description
Figure BDA00039871095600000410
The starting and ending time points of the corresponding candidate time-series video clip set Ch ;

使用δ=[δs,δe]表示真实的时间定位偏移量,

Figure BDA0003987109560000044
表示预测的时间定位偏移量,基于真实的时间定位偏移量自适应调整档期候选片段的时序边界,
Figure BDA0003987109560000045
其中SL1表示L1范数,N表示Ch集合的大小。Use δ = [δ s , δ e ] to represent the actual time positioning offset,
Figure BDA0003987109560000044
Represents the predicted time positioning offset, and adaptively adjusts the timing boundaries of the candidate segments based on the actual time positioning offset.
Figure BDA0003987109560000045
where SL 1 represents the L 1 norm and N represents the size of the Ch set.

作为上述技术方案的进一步改进,采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征,包括:As a further improvement of the above technical solution, a convolutional neural network and a long short-term memory network are used to extract features from the data set to obtain behavioral features and time domain features corresponding to the behavioral features, including:

将原始惯性传感器数据作为一张图片即时间×通道,采用长短期极意网络LSTM和卷积神经网络CNN,采用一维卷积操作获取卷积核窗口内的时间信号结构,通过卷积神经网络获取惯性传感器信号本身的关键行为特征,一维卷积计算表达式为

Figure BDA0003987109560000046
其中N表示卷积核长度,D表示传感器数据和卷积核深度,
Figure BDA0003987109560000047
表示一维卷积核深度d0中第n个权重,
Figure BDA0003987109560000048
表示深度d0下传感器信号的第i0个元素,
Figure BDA0003987109560000049
表示传感器通过卷积运算得到的第i0个特征,f(*)表示激活函数;The original inertial sensor data is taken as a picture, i.e., time × channel. The long short-term extreme network (LSTM) and convolutional neural network (CNN) are used. The one-dimensional convolution operation is used to obtain the time signal structure within the convolution kernel window. The key behavioral characteristics of the inertial sensor signal itself are obtained through the convolutional neural network. The one-dimensional convolution calculation expression is:
Figure BDA0003987109560000046
Where N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel,
Figure BDA0003987109560000047
represents the nth weight in the one-dimensional convolution kernel depth d 0 ,
Figure BDA0003987109560000048
represents the i 0th element of the sensor signal at depth d 0 ,
Figure BDA0003987109560000049
represents the i 0th feature obtained by the sensor through convolution operation, and f(*) represents the activation function;

经过池化得到的特征大小的表达式为

Figure BDA0003987109560000051
表示当前i0层的特征长度,P表示填补大小,S表示行进的步长,通过三次卷积和池化运算生成时序上低维的高层特征,再按时间顺序将CNN处理后的行为特征输入至长短期记忆网络,其中,长短期记忆网络包括两个LSTM层,每层LSTM采用单向连接,隐藏点个数设为128,将前部分得到的行为特征转换为128维的时序特征,对时间信息动态建模。The expression of the feature size obtained after pooling is
Figure BDA0003987109560000051
Represents the feature length of the current i 0 layer, P represents the padding size, S represents the step length, and generates low-dimensional high-level features in time series through three convolution and pooling operations. Then, the behavioral features processed by CNN are input into the long short-term memory network in chronological order. The long short-term memory network includes two LSTM layers. Each LSTM layer adopts unidirectional connection, and the number of hidden points is set to 128. The behavioral features obtained in the previous part are converted into 128-dimensional time series features to dynamically model the time information.

作为上述技术方案的进一步改进,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集,包括:As a further improvement of the above technical solution, the detection result of each frame is connected to the spatiotemporal channel that has formed the action, so that the video frame is preprocessed to obtain a data set corresponding to the human body action, including:

将未分割的原始视频表示为

Figure BDA0003987109560000052
其中xn表示视频X的第n帧图像,w表示视频X中的帧数,该视频中包含的所有动作集合可用一组实例
Figure BDA0003987109560000053
来表示,其中Ng表示视频X中真实动作实例的数据,ts,i、te,i分别表示动作实例
Figure BDA0003987109560000054
的开始节点和结束节点,ψg表示动作实例;The unsegmented original video is represented as
Figure BDA0003987109560000052
Where xn represents the nth frame image of video X, w represents the number of frames in video X, and all action sets contained in the video can be represented by a set of instances.
Figure BDA0003987109560000053
It is represented by Ng, where Ng represents the data of real action instances in video X, and ts,i and te ,i represent action instances respectively.
Figure BDA0003987109560000054
The start and end nodes of ψ g represent action instances;

获取收入的视频和文本构建序列化特征,根据预设的时间长度在视频序列特征的每个时间点生成多尺度的候选时序视频片段,采用时序协同注意力交互网络对候选时序片段与文本序列特征进行特征交互与融合,得到在同一特征空间嵌入的多模态融合数据以得到对应的数据集。The collected videos and texts are obtained to construct serialized features, and multi-scale candidate time-series video clips are generated at each time point of the video sequence features according to the preset time length. The temporal collaborative attention interaction network is used to perform feature interaction and fusion on the candidate time-series clips and text sequence features, and multimodal fusion data embedded in the same feature space is obtained to obtain the corresponding data set.

作为上述技术方案的进一步改进,获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,包括:As a further improvement of the above technical solution, the spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected, and the correlation between consecutive frames is used to perform action time domain detection, including:

预设一个未分割视频序列V,先按照固定帧率对视频信号进行等间隔采样得到图像帧序列,并将其分割为M个长度相等的、无重叠的视频片段单元{v1,...,vj,...,vM},使用位置编码函数为视频图像添加额外的时序位置信息,视频单元特征

Figure BDA0003987109560000055
的表达式为
Figure BDA0003987109560000056
Figure BDA0003987109560000061
其中
Figure BDA0003987109560000062
d与dv表示视频编码特征维度以及提取的视频单元特征维度;A pre-segmented video sequence V is assumed. The video signal is sampled at equal intervals at a fixed frame rate to obtain an image frame sequence, which is then segmented into M video segments {v 1 , ..., v j , ..., v M } of equal length and without overlap. The position encoding function is used to add additional temporal position information to the video image. The video unit feature
Figure BDA0003987109560000055
The expression is
Figure BDA0003987109560000056
Figure BDA0003987109560000061
in
Figure BDA0003987109560000062
d and d v represent the video coding feature dimension and the extracted video unit feature dimension;

采用单元协同注意力交互层构建视频和文本之间的交互信息,预设输入视频和文本描述的特征表示为Vin和Sin,通过线性映射将d维的特征向量变换为查询向量(Qf,Qs)、键向量(Kf,Ks)和值向量(Vf,,Vs),则对应的表达式为

Figure BDA0003987109560000063
其中Wsk、Wfk、Wsq、Wfq、Wsv和Wfv表示可学习的权重矩阵,将离子视频模态的Qf特征作为查询向量,来自文本模态的Ks与Vs特征分别作为键向量和值向量,并计算他们之间的相似度权重矩阵,得到加权后的视频特征为
Figure BDA0003987109560000064
将文本和视频特征的上下文信息融入到当前位置的特征向量中得到对应的时域特征。The unit collaborative attention interaction layer is used to construct the interactive information between video and text. The feature representations of the input video and text description are preset as Vin and Sin . The d-dimensional feature vector is transformed into a query vector ( Qf , Qs ), a key vector ( Kf , Ks ) and a value vector ( Vf , Vs ) through linear mapping. The corresponding expression is:
Figure BDA0003987109560000063
Where Wsk , Wfk , Wsq , Wfq , Wsv and Wfv represent learnable weight matrices. The Qf feature of the ion video modality is used as the query vector, and the Ks and Vs features from the text modality are used as the key vector and value vector respectively. The similarity weight matrix between them is calculated, and the weighted video feature is obtained as
Figure BDA0003987109560000064
The contextual information of text and video features is integrated into the feature vector of the current position to obtain the corresponding time domain features.

作为上述技术方案的进一步改进,获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,包括:As a further improvement of the above technical solution, obtaining a human action video to be identified, and annotating the action video to obtain a video frame includes:

采用惯性传感器获取身体部位的运动特性,每次读入未标注的视频是,切换摄像头进行标注,标注完某个动作后,通过该动作起止时的截图判断是否正确,若出现标记错误的地方,则对摄像头进行微调或重新标注。Inertial sensors are used to obtain the motion characteristics of body parts. Each time an unlabeled video is read in, the camera is switched to label it. After labeling an action, the screenshots of the start and end of the action are used to determine whether it is correct. If there are incorrectly marked areas, the camera is fine-tuned or re-labeled.

第二方面,本发明提供了一种基于多模态序列融合的动作识别系统,包括:In a second aspect, the present invention provides an action recognition system based on multimodal sequence fusion, comprising:

第一获取模块,用于获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,其中,所述动作标注包括动作的语义分割和时间线分割标签;A first acquisition module is used to acquire a human action video to be identified, and to perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

第二获取模块,用于获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,定位出动作发生的时间段,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集;The second acquisition module is used to obtain the spatial position corresponding to the human body action and detect the spatial candidate frame coordinates and action category division of each frame action, use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, connect the detection result of each frame to the spatiotemporal channel of the formed action, and pre-process the video frame to obtain the data set corresponding to the human body action;

特征提取模块,用于采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征;A feature extraction module, used to extract features from a data set using a convolutional neural network and a long short-term memory network to obtain behavioral features and time domain features corresponding to the behavioral features;

识别模块,用于根据所述行为特征和时域特征构建网络模型,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。The recognition module is used to build a network model according to the behavioral characteristics and time domain characteristics, and input multiple modal information into the network model for feature fusion and classification to complete human action recognition.

本发明提供了一种基于多模态序列融合的动作识别方法及系统,通过获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,定位出动作发生的时间段,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集,采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征,根据所述行为特征和时域特征构建网络模型,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别,采用多模态对人体行为进行识别并进行多模态融合,可以增强在现实场景下模型的精确度和鲁棒性,有效降低后期对特征向量操作过程中语义丢失的问题,特征向量内部的乘积能够充分发挥不同模态中的互补信息,从而提高了习惯工作的稳定性和识别准确性。The present invention provides an action recognition method and system based on multimodal sequence fusion, which obtains a human action video to be recognized, annotates the action video to obtain a video frame, obtains the spatial position corresponding to the human action and detects the spatial candidate frame coordinates and action category classification of each frame action, uses the correlation between consecutive frames to perform action time domain detection, locates the time period in which the action occurs, connects the detection result of each frame to the spatiotemporal channel in which the action has been formed, pre-processes the video frame to obtain a data set corresponding to the human action, uses a convolutional neural network and a long short-term memory network to extract features from the data set to obtain behavioral features and time domain features corresponding to the behavioral features, builds a network model based on the behavioral features and time domain features, inputs multiple modal information into the network model for feature fusion and classification to complete human action recognition, uses multimodality to recognize human behavior and performs multimodal fusion, can enhance the accuracy and robustness of the model in real scenarios, effectively reduce the problem of semantic loss in the later stage of feature vector operation, and the product inside the feature vector can give full play to the complementary information in different modalities, thereby improving the stability of habitual work and recognition accuracy.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

为了更清楚地说明本发明实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,应当理解,以下附图仅示出了本发明的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for use in the embodiments are briefly introduced below. It should be understood that the following drawings only show certain embodiments of the present invention and therefore should not be regarded as limiting the scope. For ordinary technicians in this field, other related drawings can be obtained based on these drawings without creative work.

图1为本发明提供的基于多模态序列融合的动作识别方法的流程图;FIG1 is a flow chart of an action recognition method based on multimodal sequence fusion provided by the present invention;

图2为本发明提供的基于多模态序列融合的动作识别系统的结构框图。FIG2 is a structural block diagram of an action recognition system based on multimodal sequence fusion provided by the present invention.

具体实施方式DETAILED DESCRIPTION

下面详细描述本发明的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本发明,而不能理解为对本发明的限制。Embodiments of the present invention are described in detail below, examples of which are shown in the accompanying drawings, wherein the same or similar reference numerals throughout represent the same or similar elements or elements having the same or similar functions. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present invention, and cannot be understood as limiting the present invention.

参阅图1,本发明提供了一种基于多模态序列融合的动作识别方法,包括以下步骤:Referring to FIG1 , the present invention provides an action recognition method based on multimodal sequence fusion, comprising the following steps:

S1:获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,其中,所述动作标注包括动作的语义分割和时间线分割标签;S1: Obtain a human action video to be identified, and perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

S2:获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,定位出动作发生的时间段,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集;S2: Obtain the spatial position corresponding to the human action and detect the spatial candidate frame coordinates and action category division of each frame action, use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, connect the detection result of each frame to the spatiotemporal channel of the formed action, and pre-process the video frame to obtain the data set corresponding to the human action;

S3:采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征;S3: Using a convolutional neural network and a long short-term memory network to extract features from the data set to obtain behavioral features and time domain features corresponding to the behavioral features;

S4:根据所述行为特征和时域特征构建网络模型,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。S4: constructing a network model according to the behavioral characteristics and time domain characteristics, inputting multiple modal information into the network model for feature fusion and classification to complete human action recognition.

本实施例中,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。包括:将多模态数据之间的交互信息作为多个模态数据所包含的共同特征,采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算,得到多模态个特征元素的信息关联张量;在计算体积时,每个特征向量加一维1,以在网络模型中保持单模态输入特征,其表达式为

Figure BDA0003987109560000081
其中模态特征
Figure BDA0003987109560000082
Figure BDA0003987109560000083
Figure BDA0003987109560000084
表示外积运算;三个特征向量分别为a、g和v做外积运算的表达式为
Figure BDA0003987109560000085
采用塔克分解对原始张量压缩,将权重杂行量表示为四个正交矩阵和一个核心张量的乘积,则四位权重张量τ进行分解t=((tc×1Wa2、Ng3Wv×4W0,分解出的核心张量
Figure BDA0003987109560000091
用于三个模态的交互,经过约束下参数个数的函数,维度大小控制整个融合模态的复杂程度,保存所有的特征向量到融合特征的映射,则三线性模型表达式为
Figure BDA0003987109560000092
其中
Figure BDA0003987109560000093
Figure BDA0003987109560000094
分别表示将各特征向量投影到各自的低维空间中,ta、tg和tv越大,需要训练参数越多,模型的复杂度越高,最后通过
Figure BDA0003987109560000095
控制融合特征向量的维度。In this embodiment, multiple modal information is input into the network model for feature fusion and classification to complete human action recognition. This includes: using the interactive information between multimodal data as the common features contained in the multiple modal data, using the tensor fusion algorithm to perform volume calculation on the feature vectors corresponding to multiple different modal data, and obtaining the information association tensor of multimodal feature elements; when calculating the volume, each feature vector is added with one dimension 1 to maintain the single modal input feature in the network model, and its expression is:
Figure BDA0003987109560000081
The modal characteristics
Figure BDA0003987109560000082
Figure BDA0003987109560000083
Figure BDA0003987109560000084
represents the outer product operation; the expression for the outer product operation of the three eigenvectors a, g and v is
Figure BDA0003987109560000085
Tucker decomposition is used to compress the original tensor, and the weighted matrix is expressed as the product of four orthogonal matrices and a core tensor. Then the four-bit weight tensor τ is decomposed into t = ((t c × 1 W a ) × 2 , N g ) × 3 W v × 4 W 0 , and the decomposed core tensor is
Figure BDA0003987109560000091
For the interaction of the three modes, the function of the number of parameters under the constraint, the dimension size controls the complexity of the entire fusion mode, and saves the mapping of all feature vectors to fusion features. The trilinear model expression is
Figure BDA0003987109560000092
in
Figure BDA0003987109560000093
and
Figure BDA0003987109560000094
Respectively represent the projection of each feature vector into its own low-dimensional space. The larger ta , tg and tv are, the more parameters need to be trained and the higher the complexity of the model.
Figure BDA0003987109560000095
Controls the dimensionality of the fused feature vector.

需要说明的是,获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,包括:采用惯性传感器获取身体部位的运动特性,每次读入未标注的视频是,切换摄像头进行标注,标注完某个动作后,通过该动作起止时的截图判断是否正确,若出现标记错误的地方,则对摄像头进行微调或重新标注。动作识别包括离线动作识别和在线动作识别,离线动作识别需要在观测到整个视频序列后,对视频中发生的人体动作类别进行判定,该任务根据预设的动作列表为每个视频分配一个动作类别标签。在线动作识别面向实际场景需求,要求实时处理在线的视频流。动作检测包括时序动作检测和时空动作检测,对于未经裁剪的视频序列,时序动作检测任务是定位出目标动作的开始和结束时间点以及相应的动作类别,时空动作检测在此基础上还需要预测动作发生的空间位置。时序动作检测包括生成具有精确时间边界的候选动作片段和为时序候选动作片段进行动作类别的划分,时空动作检测任务的一般流程为:先捕捉人体动作的空间位置即检测出每一帧动作的空间候选框坐标和动作类别得分,然后采用连续帧之间的相关性进行动作时域检测,定位出动作发生的时间段,最后将每一帧的检测结果连接,形成动作的时空通道。It should be noted that obtaining a human action video to be identified and annotating the action video to obtain a video frame includes: using an inertial sensor to obtain the motion characteristics of body parts, switching the camera for annotation each time an unannotated video is read in, and after annotating a certain action, judging whether it is correct by taking screenshots at the start and end of the action. If there is an incorrectly marked area, the camera is fine-tuned or re-annotated. Action recognition includes offline action recognition and online action recognition. Offline action recognition needs to determine the category of human actions occurring in the video after observing the entire video sequence. This task assigns an action category label to each video according to a preset action list. Online action recognition is oriented to actual scene requirements and requires real-time processing of online video streams. Action detection includes temporal action detection and spatiotemporal action detection. For uncropped video sequences, the temporal action detection task is to locate the start and end time points of the target action and the corresponding action category. On this basis, spatiotemporal action detection also needs to predict the spatial position of the action. Temporal action detection includes generating candidate action segments with precise time boundaries and classifying the action categories for the temporal candidate action segments. The general process of the spatiotemporal action detection task is: first capture the spatial position of the human body action, that is, detect the spatial candidate frame coordinates and action category scores of each frame of the action, then use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, and finally connect the detection results of each frame to form a spatiotemporal channel of the action.

应理解,短期动作预测根据滚测到的局部视频片段在动作执行的初期预测该视频中正在发生的动作类别,长期动作预测是通过观测视频中当前时刻的人体动作推测未来可能发生的动作类别。对于时序动作检测任务,通常使用平均精确度均值mAP指标来衡量算法的性能,先对每个动作类别分别计算检测结果的平均精确度AP,再对这些平均精确度求均值得到平均精确度均值mAP。当一个预测时序动作片段与其相应的真值片段的时序交并比IoU大于一个设定的阈值,且动作类别预测正确,同时满足这两种情况被认为是正确的检测结果,时序交并比IoU为预测时序动作片段与对应真值片段的交并比,其计算公式为

Figure BDA0003987109560000101
其中Tp和Tg表示算法预测的时序区间与真值区间,ζ(·)函数用于计算区间长度。通过对人体动作采集、动作检测和动作预测,可以提高动作识别的精准性,也提高了人机交互的便利性。It should be understood that short-term action prediction predicts the action category occurring in the video at the beginning of the action execution based on the local video clips measured by rolling, and long-term action prediction infers the action category that may occur in the future by observing the human body movements at the current moment in the video. For the task of sequential action detection, the mean average precision (mAP) indicator is usually used to measure the performance of the algorithm. The mean precision AP of the detection results is first calculated for each action category, and then the average of these average accuracies is calculated to obtain the mean average precision (mAP). When the temporal intersection over union (IoU) of a predicted sequential action segment and its corresponding true value segment is greater than a set threshold, and the action category prediction is correct, and both conditions are met, it is considered to be a correct detection result. The temporal intersection over union (IoU) is the intersection over union (IoU) of the predicted sequential action segment and the corresponding true value segment, and its calculation formula is:
Figure BDA0003987109560000101
Where T p and T g represent the time interval predicted by the algorithm and the true value interval, and the ζ(·) function is used to calculate the interval length. By collecting, detecting and predicting human motion, the accuracy of motion recognition can be improved, and the convenience of human-computer interaction can also be improved.

可选地,采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算,包括:Optionally, a tensor fusion algorithm is used to perform volume calculation on feature vectors corresponding to a plurality of different modal data, including:

将n模态融合特征张量进行线性映射后,得到(n+1)维的权重张量

Figure BDA0003987109560000102
对其进行张量塔克分解得到n+1个正交映射矩阵和一个核心张量的表达式为
Figure BDA0003987109560000103
Figure BDA0003987109560000104
再进一步引入秩约束进行分解,
Figure BDA0003987109560000105
Figure BDA0003987109560000106
表示一系列张量的哈密顿积,
Figure BDA00039871095600001011
为第i个模态的秩分解因子。After linear mapping of the n-modal fusion feature tensor, we get the (n+1)-dimensional weight tensor
Figure BDA0003987109560000102
Performing tensor Tucker decomposition on it, we get n+1 orthogonal mapping matrices and a core tensor expression:
Figure BDA0003987109560000103
make
Figure BDA0003987109560000104
Then we further introduce rank constraints to decompose:
Figure BDA0003987109560000105
Figure BDA0003987109560000106
represents the Hamiltonian product of a series of tensors,
Figure BDA00039871095600001011
is the rank decomposition factor of the i-th mode.

本实施例中,根所述行为特征和时域特征构建网络模型,包括:经过多模态特征融合得到多模态的语义特曾表示

Figure BDA0003987109560000107
再将其送入到神经网络层得到最终的跨模态语义特征表示
Figure BDA0003987109560000108
Figure BDA0003987109560000109
在每个时间点i预设一系列多尺度的时序候选片段
Figure BDA00039871095600001010
其中
Figure BDA0003987109560000111
表示第j个候选片段在时间点i处的开始和结束时间边界,Wj表示预先设定的第j个片段的时间宽度,
Figure BDA0003987109560000112
表示候选片段的总数;通过sigmoid激活函数(σ)评估候选片段的置信度得分的表达式为csi=σ(Convld(hi)),其中
Figure BDA0003987109560000113
表示
Figure BDA0003987109560000114
个候选片段在时间点i处的得分,该得分用于表示视频片段和文本描述的相似度,同时为每个候选片段计算其相应的预测时序边界偏移量的表达式为
Figure BDA0003987109560000115
Figure BDA0003987109560000116
表示在时间点i处的预测时序开始和结束偏移量,最终在时间点i处的预测片段j表示为
Figure BDA0003987109560000117
In this embodiment, a network model is constructed based on the behavioral features and the time domain features, including: obtaining a multimodal semantic feature representation by multimodal feature fusion
Figure BDA0003987109560000107
Then send it to the neural network layer to obtain the final cross-modal semantic feature representation
Figure BDA0003987109560000108
Figure BDA0003987109560000109
At each time point i, a series of multi-scale temporal candidate segments are preset
Figure BDA00039871095600001010
in
Figure BDA0003987109560000111
represents the start and end time boundaries of the j-th candidate segment at time point i, Wj represents the preset time width of the j-th segment,
Figure BDA0003987109560000112
represents the total number of candidate segments; the expression for evaluating the confidence score of the candidate segment by the sigmoid activation function (σ) is cs i =σ(Convld( hi )), where
Figure BDA0003987109560000113
express
Figure BDA0003987109560000114
The score of the candidate segment at time point i is used to represent the similarity between the video segment and the text description. At the same time, the expression for calculating the corresponding predicted temporal boundary offset for each candidate segment is:
Figure BDA0003987109560000115
Figure BDA0003987109560000116
represents the start and end offsets of the prediction sequence at time point i, and finally the prediction segment j at time point i is expressed as
Figure BDA0003987109560000117

需要说明的是,损失函数有用于计算视频片段与文本描述匹配分数的损失函数和用于计算时序边界偏移量的损失函数组成,计算每个候选时序片段

Figure BDA0003987109560000118
和目标片段(s,e)的时间重叠交并比IoU,若小于预设的阈值λ,则将IoU设置为0,若大于阈值λ,则确定该候选片段为正样本,否则为负样本,匹配损失函数的表达式为
Figure BDA0003987109560000119
其中Npos表示候选时序片段正样本个数,Nneg表示负样本个数;采用边界回归策略调整时间定位的偏移量,计算候选片段与目标片段的IoU,并选择大于设定阈值Y的候选时序片段集合Ch,计算这些候选时序片段的时序边界偏移量的表达式为
Figure BDA00039871095600001110
其中(s,e)表示给定文本描述的开始和结束时间点
Figure BDA00039871095600001111
对应候选时序视频片段集合Ch的开始和结束时间点;使用δ=[δs,δe]表示真实的时间定位偏移量,
Figure BDA00039871095600001112
表示预测的时间定位偏移量,基于真实的时间定位偏移量自适应调整档期候选片段的时序边界,
Figure BDA00039871095600001113
Figure BDA00039871095600001114
其中SL1表示L1范数,N表示Ch集合的大小。It should be noted that the loss function consists of a loss function for calculating the matching score between the video clip and the text description and a loss function for calculating the temporal boundary offset.
Figure BDA0003987109560000118
The temporal overlap intersection over union (IoU) of the target segment (s, e) is calculated. If it is less than the preset threshold λ, the IoU is set to 0. If it is greater than the threshold λ, the candidate segment is determined to be a positive sample, otherwise it is a negative sample. The expression of the matching loss function is:
Figure BDA0003987109560000119
Where N pos represents the number of positive samples of candidate time series segments, and N neg represents the number of negative samples. The boundary regression strategy is used to adjust the offset of time positioning, calculate the IoU between the candidate segment and the target segment, and select the set of candidate time series segments Ch that is greater than the set threshold Y. The expression for calculating the time boundary offset of these candidate time series segments is:
Figure BDA00039871095600001110
Where (s, e) represents the start and end time points of a given text description
Figure BDA00039871095600001111
Corresponding to the start and end time points of the candidate temporal video segment set Ch ; using δ = [δ s , δ e ] to represent the actual time positioning offset,
Figure BDA00039871095600001112
Represents the predicted time positioning offset, and adaptively adjusts the timing boundaries of the candidate segments based on the actual time positioning offset.
Figure BDA00039871095600001113
Figure BDA00039871095600001114
where SL 1 represents the L 1 norm and N represents the size of the Ch set.

应理解,塔克分解是主成分分析的多线性形式,每个张量可以不唯一地表示为核心张量即主成分因子和所有阶上的因子矩阵的乘积,使用塔克分解具有的优势为:与需要评估秩的大小和逼近初始张量的CP分解相比,使用塔克分解能获得更精确的张量分解结果,也可以通过调整核心张量维度来实现对每个模态特征向量进行特征选择的目的。为了进一步减少融合模型的计算复杂度,平衡交互融合建模的复杂性和表达性,根据核心张量的稀疏性,引入结构化稀疏约束,将权重核心张量分解为多个因子,秩约束在训练的过程中作为正则化来防止过拟合,能够灵活的调整输入和数据的映射。It should be understood that Tucker decomposition is a multilinear form of principal component analysis. Each tensor can be non-uniquely represented as the core tensor, i.e., the product of the principal component factor and the factor matrix at all orders. The advantages of using Tucker decomposition are: compared with the CP decomposition that requires evaluating the size of the rank and approximating the initial tensor, using Tucker decomposition can obtain more accurate tensor decomposition results, and can also achieve the purpose of feature selection for each modal eigenvector by adjusting the core tensor dimension. In order to further reduce the computational complexity of the fusion model and balance the complexity and expressiveness of interactive fusion modeling, structured sparse constraints are introduced according to the sparsity of the core tensor, and the weight core tensor is decomposed into multiple factors. The rank constraint is used as regularization to prevent overfitting during the training process, and the mapping of input and data can be flexibly adjusted.

可选地,采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征,包括:Optionally, a convolutional neural network and a long short-term memory network are used to extract features from a data set to obtain behavioral features and time domain features corresponding to the behavioral features, including:

将原始惯性传感器数据作为一张图片即时间×通道,采用长短期极意网络LSTM和卷积神经网络CNN,采用一维卷积操作获取卷积核窗口内的时间信号结构,通过卷积神经网络获取惯性传感器信号本身的关键行为特征,一维卷积计算表达式为

Figure BDA0003987109560000121
其中N表示卷积核长度,D表示传感器数据和卷积核深度,
Figure BDA0003987109560000122
表示一维卷积核深度d0中第n个权重,
Figure BDA0003987109560000123
表示深度d0下传感器信号的第i0个元素,
Figure BDA0003987109560000124
表示传感器通过卷积运算得到的第i0个特征,f(*)表示激活函数;The original inertial sensor data is taken as a picture, i.e., time × channel. The long short-term extreme network (LSTM) and convolutional neural network (CNN) are used. The one-dimensional convolution operation is used to obtain the time signal structure within the convolution kernel window. The key behavioral characteristics of the inertial sensor signal itself are obtained through the convolutional neural network. The one-dimensional convolution calculation expression is:
Figure BDA0003987109560000121
Where N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel,
Figure BDA0003987109560000122
represents the nth weight in the one-dimensional convolution kernel depth d 0 ,
Figure BDA0003987109560000123
represents the i 0th element of the sensor signal at depth d 0 ,
Figure BDA0003987109560000124
represents the i 0th feature obtained by the sensor through convolution operation, and f(*) represents the activation function;

经过池化得到的特征大小的表达式为

Figure BDA0003987109560000125
表示当前i0层的特征长度,P表示填补大小,S表示行进的步长,通过三次卷积和池化运算生成时序上低维的高层特征,再按时间顺序将CNN处理后的行为特征输入至长短期记忆网络,其中,长短期记忆网络包括两个LSTM层,每层LSTM采用单向连接,隐藏点个数设为128,将前部分得到的行为特征转换为128维的时序特征,对时间信息动态建模。The expression of the feature size obtained after pooling is
Figure BDA0003987109560000125
Represents the feature length of the current i 0 layer, P represents the padding size, S represents the step length, and generates low-dimensional high-level features in time series through three convolution and pooling operations. Then, the behavioral features processed by CNN are input into the long short-term memory network in chronological order. The long short-term memory network includes two LSTM layers. Each LSTM layer adopts unidirectional connection, and the number of hidden points is set to 128. The behavioral features obtained in the previous part are converted into 128-dimensional time series features to dynamically model the time information.

本实施例中,卷积神经网络件原始惯性传感器数据当作一张图片,根据传感器图像中存在固有的局部模式,利用共享卷积核对图像进行卷积操作,但从其中提取部分特征,而对于隐藏在其中的时序信息没有做进一步处理,忽略了人体行为的连续性,长短期记忆网络是将原始惯性传感器信号直接输入到LSTM中,缺少对传感器数据的整合,导致算法运行速度比较慢,虽然长短期记忆网络采用门控机制可以在一定程度上解决递归神经网络的梯度消失问题,但不能处理更长的时间序列信息。通过双层LSTM获取不同信号帧之间上下文关联的时域信息,采用门控机制选择性地保留输入的CNN提取特征中获得的行为信息,以更好地对惯性传感器信号特征进行时序激励,获得与行为识别相关的时空特征,实现空间-时间行为特征学习。In this embodiment, the convolutional neural network treats the original inertial sensor data as a picture. According to the inherent local pattern in the sensor image, the image is convolved using a shared convolution kernel, but some features are extracted from it, and the time series information hidden therein is not further processed, ignoring the continuity of human behavior. The long short-term memory network directly inputs the original inertial sensor signal into the LSTM, lacks the integration of sensor data, and causes the algorithm to run slowly. Although the long short-term memory network adopts a gating mechanism to solve the gradient disappearance problem of the recursive neural network to a certain extent, it cannot process longer time series information. The context-related time domain information between different signal frames is obtained through a double-layer LSTM, and the gating mechanism is used to selectively retain the behavior information obtained in the input CNN extraction feature, so as to better perform time series excitation on the inertial sensor signal features, obtain the spatiotemporal features related to behavior recognition, and realize space-time behavior feature learning.

可选地,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集,包括:Optionally, the detection result of each frame is connected to the spatiotemporal channel of the formed action, so that the video frame is preprocessed to obtain a data set corresponding to the human body action, including:

将未分割的原始视频表示为

Figure BDA0003987109560000131
其中xn表示视频X的第n帧图像,w表示视频X中的帧数,该视频中包含的所有动作集合可用一组实例
Figure BDA0003987109560000132
来表示,其中Ng表示视频X中真实动作实例的数据,ts,i、te,i分别表示动作实例
Figure BDA0003987109560000133
的开始节点和结束节点,ψg表示动作实例;The unsegmented original video is represented as
Figure BDA0003987109560000131
Where xn represents the nth frame image of video X, w represents the number of frames in video X, and all action sets contained in the video can be represented by a set of instances.
Figure BDA0003987109560000132
It is represented by Ng, where Ng represents the data of real action instances in video X, and ts,i and te ,i represent action instances respectively.
Figure BDA0003987109560000133
The start and end nodes of ψ g represent action instances;

获取收入的视频和文本构建序列化特征,根据预设的时间长度在视频序列特征的每个时间点生成多尺度的候选时序视频片段,采用时序协同注意力交互网络对候选时序片段与文本序列特征进行特征交互与融合,得到在同一特征空间嵌入的多模态融合数据以得到对应的数据集。The collected videos and texts are obtained to construct serialized features, and multi-scale candidate time-series video clips are generated at each time point of the video sequence features according to the preset time length. The temporal collaborative attention interaction network is used to perform feature interaction and fusion on the candidate time-series clips and text sequence features, and multimodal fusion data embedded in the same feature space is obtained to obtain the corresponding data set.

本实施例中,获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,包括:预设一个未分割视频序列V,先按照固定帧率对视频信号进行等间隔采样得到图像帧序列,并将其分割为M个长度相等的、无重叠的视频片段单元{v1,...,vj,...,vM},使用位置编码函数为视频图像添加额外的时序位置信息,视频单元特征

Figure BDA0003987109560000134
的表达式为
Figure BDA0003987109560000141
其中
Figure BDA0003987109560000142
d与dv表示视频编码特征维度以及提取的视频单元特征维度;采用单元协同注意力交互层构建视频和文本之间的交互信息,预设输入视频和文本描述的特征表示为Vin和Sin,通过线性映射将d维的特征向量变换为查询向量(Qf,Qs)、键向量(Kf,Ks)和值向量(Vf,,Vs),则对应的表达式为
Figure BDA0003987109560000143
其中Wsk、Wfk、Wsq、Wfq、Wsv和Wfv表示可学习的权重矩阵,将离子视频模态的Qf特征作为查询向量,来自文本模态的Ks与Vs特征分别作为键向量和值向量,并计算他们之间的相似度权重矩阵,得到加权后的视频特征为
Figure BDA0003987109560000144
将文本和视频特征的上下文信息融入到当前位置的特征向量中得到对应的时域特征。In this embodiment, the spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected, and the correlation between consecutive frames is used to perform action time domain detection, including: presetting an unsegmented video sequence V, first sampling the video signal at equal intervals according to a fixed frame rate to obtain an image frame sequence, and dividing it into M video clip units {v 1 , ..., v j , ..., v M } of equal length and non-overlapping, using a position encoding function to add additional temporal position information to the video image, and the video unit feature
Figure BDA0003987109560000134
The expression is
Figure BDA0003987109560000141
in
Figure BDA0003987109560000142
d and d v represent the video encoding feature dimension and the extracted video unit feature dimension; the unit collaborative attention interaction layer is used to construct the interactive information between video and text. The feature representations of the input video and text description are preset as Vin and Sin . The d-dimensional feature vector is transformed into a query vector ( Qf , Qs ), a key vector ( Kf , Ks ) and a value vector ( Vf ,, Vs ) through linear mapping. The corresponding expression is:
Figure BDA0003987109560000143
Where Wsk , Wfk , Wsq , Wfq , Wsv and Wfv represent learnable weight matrices. The Qf feature of the ion video modality is used as the query vector, and the Ks and Vs features from the text modality are used as the key vector and value vector respectively. The similarity weight matrix between them is calculated, and the weighted video feature is obtained as
Figure BDA0003987109560000144
The contextual information of text and video features is integrated into the feature vector of the current position to obtain the corresponding time domain features.

本实施例中,模态是指表征信息的不同方式,包括对事物的各种感官方式,多模态是指两种或多种形式模态之间的组合,进行多模态融合的原因是不同的模态看待问题具有不同的表达式和不同的角度,多模态数据中存在各种不同的信息交叉和数据互补,多模态比单个模态效果好。在人体行为识别领域中,加速度、角速度和RGB视频图像数据三者属于异构类型的数据,都有各自的特性,惯性传感器只能获得身体部位的运动特性,对于精细的动作无法准确识别,如一些手部动作细节,而RGB视频受到遮挡物和光照的影响,当人体被遮挡时,只能通过惯性传感器进行识别。深度学习多模态特征层融合方法是级联融合、相加融合、级联融合方法是将多个模态特征向量进行拼接,从而增加了总体特征向量的维度。In this embodiment, modality refers to different ways of representing information, including various sensory ways of things, and multimodality refers to the combination of two or more forms of modalities. The reason for multimodal fusion is that different modalities have different expressions and different angles to look at problems. There are various different information intersections and data complementarity in multimodal data, and multimodality is better than a single modality. In the field of human behavior recognition, acceleration, angular velocity and RGB video image data are heterogeneous types of data, and each has its own characteristics. Inertial sensors can only obtain the motion characteristics of body parts and cannot accurately identify fine movements, such as some hand movement details. RGB videos are affected by occlusions and lighting. When the human body is occluded, it can only be identified by inertial sensors. The deep learning multimodal feature layer fusion method is cascade fusion, additive fusion, and cascade fusion method. It splices multiple modal feature vectors, thereby increasing the dimension of the overall feature vector.

参阅图2,本发明提供了一种基于多模态序列融合的动作识别系统,包括:Referring to FIG. 2 , the present invention provides an action recognition system based on multimodal sequence fusion, comprising:

第一获取模块,用于获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,其中,所述动作标注包括动作的语义分割和时间线分割标签;A first acquisition module is used to acquire a human action video to be identified, and to perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action;

第二获取模块,用于获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,定位出动作发生的时间段,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集;The second acquisition module is used to obtain the spatial position corresponding to the human body action and detect the spatial candidate frame coordinates and action category division of each frame action, use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, connect the detection result of each frame to the spatiotemporal channel of the formed action, and pre-process the video frame to obtain the data set corresponding to the human body action;

特征提取模块,用于采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征;A feature extraction module, used to extract features from a data set using a convolutional neural network and a long short-term memory network to obtain behavioral features and time domain features corresponding to the behavioral features;

识别模块,用于根据所述行为特征和时域特征构建网络模型,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。The recognition module is used to build a network model according to the behavioral characteristics and time domain characteristics, and input multiple modal information into the network model for feature fusion and classification to complete human action recognition.

本实施例中,人体行为识别是对采集到的用户运动信息和数据通过一定手段和方法进行相关地分类识别,以此来判断用户活动状态或检测用户行为。处理时间序列数据、时空结构和时间结构是过度行为,若不能充分利用时间特征,对于行为识别模型来将是一个很大的损失。通过获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,定位出动作发生的时间段,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集,采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征,根所述行为特征和时域特征构建网络模型,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别,采用多模态对人体行为进行识别并进行多模态融合,可以增强在现实场景下模型的精确度和鲁棒性,有效降低后期对特征向量操作过程中语义丢失的问题,特征向量内部的乘积能够充分发挥不同模态中的互补信息,从而提高了习惯工作的稳定性和识别准确性。In this embodiment, human behavior recognition is to classify and identify the collected user movement information and data through certain means and methods, so as to judge the user's activity status or detect the user's behavior. Processing time series data, spatiotemporal structure and time structure is an excessive behavior. If the time feature cannot be fully utilized, it will be a great loss for the behavior recognition model. By acquiring a human action video to be identified, annotating the action video to obtain a video frame, obtaining the spatial position corresponding to the human action and detecting the spatial candidate frame coordinates and action category division of each frame action, using the correlation between consecutive frames to perform action time domain detection, locating the time period in which the action occurs, connecting the detection result of each frame to the spatiotemporal channel that has formed the action, and preprocessing the video frame to obtain a data set corresponding to the human action, using a convolutional neural network and a long short-term memory network to extract features from the data set to obtain behavioral features and time domain features corresponding to the behavioral features, constructing a network model based on the behavioral features and time domain features, inputting multiple modal information into the network model for feature fusion and classification to complete human action recognition, using multi-modality to identify human behavior and perform multi-modal fusion, the accuracy and robustness of the model in real scenarios can be enhanced, and the problem of semantic loss in the later operation of feature vectors can be effectively reduced. The product within the feature vector can give full play to the complementary information in different modalities, thereby improving the stability of habitual work and recognition accuracy.

在这里示出和描述的所有示例中,任何具体值应被解释为仅仅是示例性的,而不是作为限制,因此,示例性实施例的其他示例可以具有不同的值。In all examples shown and described herein, any specific values should be interpreted as merely exemplary and not as limiting, and thus other examples of the exemplary embodiments may have different values.

应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。It should be noted that similar reference numerals and letters denote similar items in the following drawings, and therefore, once an item is defined in one drawing, it does not require further definition and explanation in the subsequent drawings.

以上所述实施例仅表达了本发明的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对本发明范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本发明构思的前提下,还可以做出若干变形和改进,这些都属于本发明的保护范围。The above-mentioned embodiments only express several implementation methods of the present invention, and the description thereof is relatively specific and detailed, but it cannot be understood as limiting the scope of the present invention. It should be pointed out that for ordinary technicians in this field, several variations and improvements can be made without departing from the concept of the present invention, which all belong to the protection scope of the present invention.

Claims (10)

1.一种基于多模态序列融合的动作识别方法,其特征在于,包括以下步骤:1. An action recognition method based on multimodal sequence fusion, characterized in that it comprises the following steps: 获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,其中,所述动作标注包括动作的语义分割和时间线分割标签;Obtain a human action video to be identified, and perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action; 获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,定位出动作发生的时间段,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集;The spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected. The correlation between consecutive frames is used to perform action time domain detection, and the time period in which the action occurs is located. The detection result of each frame is connected to the spatiotemporal channel of the formed action, and the video frame is preprocessed to obtain a data set corresponding to the human body action; 采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征;Using convolutional neural networks and long short-term memory networks to extract features from a data set to obtain behavioral features and time domain features corresponding to the behavioral features; 根据所述行为特征和时域特征构建网络模型,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。A network model is constructed according to the behavioral characteristics and time domain characteristics, and multiple modal information is input into the network model for feature fusion and classification to complete human action recognition. 2.根据权利要求1所述的基于多模态序列融合的动作识别方法,其特征在于,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。包括:2. The method for action recognition based on multimodal sequence fusion according to claim 1, characterized in that multiple modal information is input into the network model for feature fusion and classification to complete human action recognition. It includes: 将多模态数据之间的交互信息作为多个模态数据所包含的共同特征,采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算,得到多模态个特征元素的信息关联张量;The interactive information between multi-modal data is regarded as the common features contained in multiple modal data. The tensor fusion algorithm is used to perform volume operation on the feature vectors corresponding to multiple different modal data to obtain the information correlation tensor of multi-modal feature elements. 在计算体积时,每个特征向量加一维1,以在网络模型中保持单模态输入特征,其表达式为
Figure FDA0003987109550000011
其中模态特征
Figure FDA0003987109550000012
Figure FDA0003987109550000013
Figure FDA0003987109550000014
表示外积运算;三个特征向量分别为a、g和v做外积运算的表达式为
Figure FDA0003987109550000015
When calculating the volume, each feature vector is added with a dimension of 1 to maintain the unimodal input features in the network model, and its expression is:
Figure FDA0003987109550000011
The modal characteristics
Figure FDA0003987109550000012
Figure FDA0003987109550000013
Figure FDA0003987109550000014
represents the outer product operation; the expression for the outer product operation of the three eigenvectors a, g and v is
Figure FDA0003987109550000015
采用塔克分解对原始张量压缩,将权重杂行量表示为四个正交矩阵和一个核心张量的乘积,则四位权重张量τ进行分解t=((tc×1Wa2Wg3Wv×4W0,分解出的核心张量
Figure FDA0003987109550000016
用于三个模态的交互,经过约束下参数个数的函数,维度大小控制整个融合模态的复杂程度,保存所有的特征向量到融合特征的映射,则三线性模型表达式为
Figure FDA0003987109550000021
其中
Figure FDA0003987109550000022
Figure FDA0003987109550000023
分别表示将各特征向量投影到各自的低维空间中,ta、tg和tv越大,需要训练参数越多,模型的复杂度越高,最后通过
Figure FDA00039871095500000217
控制融合特征向量的维度。
Tucker decomposition is used to compress the original tensor, and the weighted matrix is expressed as the product of four orthogonal matrices and a core tensor. Then the four-bit weight tensor τ is decomposed into t = ((t c × 1 W a ) × 2 W g ) × 3 W v × 4 W 0 , and the decomposed core tensor is
Figure FDA0003987109550000016
For the interaction of the three modes, the function of the number of parameters under the constraint, the dimension size controls the complexity of the entire fusion mode, and saves the mapping of all feature vectors to fusion features. The trilinear model expression is
Figure FDA0003987109550000021
in
Figure FDA0003987109550000022
and
Figure FDA0003987109550000023
Respectively represent the projection of each feature vector into its own low-dimensional space. The larger ta , tg and tv are, the more parameters need to be trained and the higher the complexity of the model.
Figure FDA00039871095500000217
Controls the dimensionality of the fused feature vector.
3.根据权利要求2所述的基于多模态序列融合的动作识别方法,其特征在于,采用张量融合算法将多个不同模态数据对应的特征向量进行体积运算,包括:3. The method for motion recognition based on multimodal sequence fusion according to claim 2 is characterized in that a tensor fusion algorithm is used to perform volume calculation on feature vectors corresponding to a plurality of different modal data, comprising: 将n模态融合特征张量进行线性映射后,得到(n+1)维的权重张量
Figure FDA0003987109550000024
对其进行张量塔克分解得到n+1个正交映射矩阵和一个核心张量的表达式为
Figure FDA0003987109550000025
Figure FDA0003987109550000026
y=zTW0
Figure FDA0003987109550000027
再进一步引入秩约束进行分解,
Figure FDA0003987109550000028
Figure FDA0003987109550000029
表示一系列张量的哈密顿积,
Figure FDA00039871095500000210
为第i个模态的秩分解因子。
After linear mapping of the n-modal fusion feature tensor, we get the (n+1)-dimensional weight tensor
Figure FDA0003987109550000024
Performing tensor Tucker decomposition on it, we get n+1 orthogonal mapping matrices and a core tensor expression:
Figure FDA0003987109550000025
make
Figure FDA0003987109550000026
y=z T W 0
Figure FDA0003987109550000027
Then we further introduce rank constraints to decompose:
Figure FDA0003987109550000028
Figure FDA0003987109550000029
represents the Hamiltonian product of a series of tensors,
Figure FDA00039871095500000210
is the rank decomposition factor of the i-th mode.
4.根据权利要求1所述的基于多模态序列融合的动作识别方法,其特征在于,根所述行为特征和时域特征构建网络模型,包括:4. The action recognition method based on multimodal sequence fusion according to claim 1 is characterized in that a network model is constructed based on the behavioral features and time domain features, comprising: 经过多模态特征融合得到多模态的语义特曾表示
Figure FDA00039871095500000211
Figure FDA00039871095500000212
再将其送入到神经网络层得到最终的跨模态语义特征表示
Figure FDA00039871095500000213
在每个时间点i预设一系列多尺度的时序候选片段
Figure FDA00039871095500000214
其中
Figure FDA00039871095500000215
表示第j个候选片段在时间点i处的开始和结束时间边界,Wj表示预先设定的第j个片段的时间宽度,
Figure FDA00039871095500000216
表示候选片段的总数;
After multimodal feature fusion, multimodal semantic features are obtained.
Figure FDA00039871095500000211
Figure FDA00039871095500000212
Then send it to the neural network layer to obtain the final cross-modal semantic feature representation
Figure FDA00039871095500000213
At each time point i, a series of multi-scale temporal candidate segments are preset
Figure FDA00039871095500000214
in
Figure FDA00039871095500000215
represents the start and end time boundaries of the j-th candidate segment at time point i, Wj represents the preset time width of the j-th segment,
Figure FDA00039871095500000216
represents the total number of candidate segments;
通过sigmoid激活函数(σ)评估候选片段的置信度得分的表达式为csi=σ(Convld(hi)),其中
Figure FDA0003987109550000031
表示
Figure FDA0003987109550000032
个候选片段在时间点i处的得分,该得分用于表示视频片段和文本描述的相似度,同时为每个候选片段计算其相应的预测时序边界偏移量的表达式为
Figure FDA0003987109550000033
Figure FDA0003987109550000034
表示在时间点i处的预测时序开始和结束偏移量,最终在时间点i处的预测片段j表示为
Figure FDA0003987109550000035
The expression for evaluating the confidence score of a candidate segment by the sigmoid activation function (σ) is cs i =σ(Convld( hi )), where
Figure FDA0003987109550000031
express
Figure FDA0003987109550000032
The score of the candidate segment at time point i is used to represent the similarity between the video segment and the text description. At the same time, the expression for calculating the corresponding predicted temporal boundary offset for each candidate segment is:
Figure FDA0003987109550000033
Figure FDA0003987109550000034
represents the start and end offsets of the prediction sequence at time point i, and finally the prediction segment j at time point i is expressed as
Figure FDA0003987109550000035
5.根据权利要求4所述的基于多模态序列融合的动作识别方法,其特征在于,损失函数有用于计算视频片段与文本描述匹配分数的损失函数和用于计算时序边界偏移量的损失函数组成,计算每个候选时序片段
Figure FDA0003987109550000036
和目标片段(s,e)的时间重叠交并比IoU,若小于预设的阈值λ,则将IoU设置为0,若大于阈值λ,则确定该候选片段为正样本,否则为负样本,匹配损失函数的表达式为
Figure FDA0003987109550000037
其中Npos表示候选时序片段正样本个数,Nneg表示负样本个数;
5. The method for action recognition based on multimodal sequence fusion according to claim 4 is characterized in that the loss function consists of a loss function for calculating the matching score between the video clip and the text description and a loss function for calculating the temporal boundary offset, and the loss function for calculating each candidate temporal clip is used to calculate the matching score between the video clip and the text description.
Figure FDA0003987109550000036
The temporal overlap intersection over union (IoU) of the target segment (s, e) is calculated. If it is less than the preset threshold λ, the IoU is set to 0. If it is greater than the threshold λ, the candidate segment is determined to be a positive sample, otherwise it is a negative sample. The expression of the matching loss function is:
Figure FDA0003987109550000037
Where N pos represents the number of positive samples of the candidate time series segments, and N neg represents the number of negative samples;
采用边界回归策略调整时间定位的偏移量,计算候选片段与目标片段的IoU,并选择大于设定阈值γ的候选时序片段集合Ch,计算这些候选时序片段的时序边界偏移量的表达式为
Figure FDA0003987109550000038
其中(s,e)表示给定文本描述的开始和结束时间点
Figure FDA0003987109550000039
对应候选时序视频片段集合Ch的开始和结束时间点;
The boundary regression strategy is used to adjust the offset of temporal positioning, the IoU between the candidate segment and the target segment is calculated, and the candidate temporal segment set Ch that is greater than the set threshold γ is selected. The expression for calculating the temporal boundary offset of these candidate temporal segments is:
Figure FDA0003987109550000038
Where (s, e) represents the start and end time points of a given text description
Figure FDA0003987109550000039
The starting and ending time points of the corresponding candidate time-series video clip set Ch ;
使用δ=[δse]表示真实的时间定位偏移量,
Figure FDA00039871095500000310
表示预测的时间定位偏移量,基于真实的时间定位偏移量自适应调整档期候选片段的时序边界,
Figure FDA00039871095500000311
其中SL1表示L1范数,N表示Ch集合的大小。
Use δ = [δ se ] to represent the actual time positioning offset,
Figure FDA00039871095500000310
Represents the predicted time positioning offset, and adaptively adjusts the timing boundaries of the candidate segments based on the actual time positioning offset.
Figure FDA00039871095500000311
where SL 1 represents the L 1 norm and N represents the size of the Ch set.
6.根据权利要求1所述的基于多模态序列融合的动作识别方法,其特征在于,采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征,包括:6. The method for action recognition based on multimodal sequence fusion according to claim 1 is characterized in that a convolutional neural network and a long short-term memory network are used to extract features from a data set to obtain behavioral features and time domain features corresponding to the behavioral features, including: 将原始惯性传感器数据作为一张图片即时间×通道,采用长短期极意网络LSTM和卷积神经网络CNN,采用一维卷积操作获取卷积核窗口内的时间信号结构,通过卷积神经网络获取惯性传感器信号本身的关键行为特征,一维卷积计算表达式为
Figure FDA0003987109550000041
其中N表示卷积核长度,D表示传感器数据和卷积核深度,
Figure FDA0003987109550000042
表示一维卷积核深度d0中第n个权重,
Figure FDA0003987109550000043
表示深度d0下传感器信号的第i0个元素,
Figure FDA0003987109550000044
表示传感器通过卷积运算得到的第i0个特征,f(*)表示激活函数;
The original inertial sensor data is taken as a picture, i.e., time × channel. The long short-term extreme network (LSTM) and convolutional neural network (CNN) are used. The one-dimensional convolution operation is used to obtain the time signal structure within the convolution kernel window. The key behavioral characteristics of the inertial sensor signal itself are obtained through the convolutional neural network. The one-dimensional convolution calculation expression is:
Figure FDA0003987109550000041
Where N represents the length of the convolution kernel, D represents the sensor data and the depth of the convolution kernel,
Figure FDA0003987109550000042
represents the nth weight in the one-dimensional convolution kernel depth d 0 ,
Figure FDA0003987109550000043
represents the i 0th element of the sensor signal at depth d 0 ,
Figure FDA0003987109550000044
represents the i 0th feature obtained by the sensor through convolution operation, and f(*) represents the activation function;
经过池化得到的特征大小的表达式为
Figure FDA0003987109550000045
Figure FDA0003987109550000046
表示当前i0层的特征长度,P表示填补大小,S表示行进的步长,通过三次卷积和池化运算生成时序上低维的高层特征,再按时间顺序将CNN处理后的行为特征输入至长短期记忆网络,其中,长短期记忆网络包括两个LSTM层,每层LSTM采用单向连接,隐藏点个数设为128,将前部分得到的行为特征转换为128维的时序特征,对时间信息动态建模。
The expression of the feature size obtained after pooling is
Figure FDA0003987109550000045
Figure FDA0003987109550000046
Represents the feature length of the current i 0 layer, P represents the padding size, S represents the step length, and generates low-dimensional high-level features in time series through three convolution and pooling operations. Then, the behavioral features processed by CNN are input into the long short-term memory network in chronological order. The long short-term memory network includes two LSTM layers. Each LSTM layer adopts unidirectional connection, and the number of hidden points is set to 128. The behavioral features obtained in the previous part are converted into 128-dimensional time series features to dynamically model the time information.
7.根据权利要求1所述的基于多模态序列融合的动作识别方法,其特征在于,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集,包括:7. The method for action recognition based on multimodal sequence fusion according to claim 1 is characterized in that the detection result of each frame is connected to the spatiotemporal channel that has formed the action, so that the video frame is preprocessed to obtain a data set corresponding to the human action, comprising: 将未分割的原始视频表示为
Figure FDA0003987109550000047
其中xn表示视频X的第n帧图像,w表示视频X中的帧数,该视频中包含的所有动作集合可用一组实例
Figure FDA0003987109550000048
来表示,其中Ng表示视频X中真实动作实例的数据,ts,i、te,i分别表示动作实例
Figure FDA0003987109550000049
的开始节点和结束节点,ψg表示动作实例;
The unsegmented original video is represented as
Figure FDA0003987109550000047
Where xn represents the nth frame image of video X, w represents the number of frames in video X, and all action sets contained in the video can be represented by a set of instances.
Figure FDA0003987109550000048
It is represented by Ng, where Ng represents the data of real action instances in video X, and ts,i and te ,i represent action instances respectively.
Figure FDA0003987109550000049
The start and end nodes of ψ g represent action instances;
获取收入的视频和文本构建序列化特征,根据预设的时间长度在视频序列特征的每个时间点生成多尺度的候选时序视频片段,采用时序协同注意力交互网络对候选时序片段与文本序列特征进行特征交互与融合,得到在同一特征空间嵌入的多模态融合数据以得到对应的数据集。The collected videos and texts are obtained to construct serialized features, and multi-scale candidate time-series video clips are generated at each time point of the video sequence features according to the preset time length. The temporal collaborative attention interaction network is used to perform feature interaction and fusion on the candidate time-series clips and text sequence features, and multimodal fusion data embedded in the same feature space is obtained to obtain the corresponding data set.
8.根据权利要求1所述的基于多模态序列融合的动作识别方法,其特征在于,获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,包括:8. The method for action recognition based on multimodal sequence fusion according to claim 1 is characterized in that the spatial position corresponding to the human body action is obtained and the spatial candidate frame coordinates and action category division of each frame action are detected, and the correlation between consecutive frames is used to perform action time domain detection, including: 预设一个未分割视频序列V,先按照固定帧率对视频信号进行等间隔采样得到图像帧序列,并将其分割为M个长度相等的、无重叠的视频片段单元{v1,...,vj,...,vM},使用位置编码函数为视频图像添加额外的时序位置信息,视频单元特征
Figure FDA0003987109550000051
的表达式为
Figure FDA0003987109550000052
Figure FDA0003987109550000053
其中
Figure FDA0003987109550000054
d与dv表示视频编码特征维度以及提取的视频单元特征维度;
Preset an unsegmented video sequence V, first sample the video signal at equal intervals at a fixed frame rate to obtain an image frame sequence, and then segment it into M video clip units {v 1 ,...,v j ,...,v M } of equal length and without overlap. Use the position encoding function to add additional temporal position information to the video image. The video unit feature
Figure FDA0003987109550000051
The expression is
Figure FDA0003987109550000052
Figure FDA0003987109550000053
in
Figure FDA0003987109550000054
d and d v represent the video coding feature dimension and the extracted video unit feature dimension;
采用单元协同注意力交互层构建视频和文本之间的交互信息,预设输入视频和文本描述的特征表示为Vin和Sin,通过线性映射将d维的特征向量变换为查询向量(Qf,Qs)、键向量(Kf,Ks)和值向量(Vf,,Vs),则对应的表达式为
Figure FDA0003987109550000055
其中Wsk、Wfk、Wsq、Wfq、Wsv和Wfv表示可学习的权重矩阵,将离子视频模态的Qf特征作为查询向量,来自文本模态的Ks与Vs特征分别作为键向量和值向量,并计算他们之间的相似度权重矩阵,得到加权后的视频特征为
Figure FDA0003987109550000056
将文本和视频特征的上下文信息融入到当前位置的特征向量中得到对应的时域特征。
The unit collaborative attention interaction layer is used to construct the interactive information between video and text. The feature representations of the input video and text description are preset as Vin and Sin . The d-dimensional feature vector is transformed into a query vector ( Qf , Qs ), a key vector ( Kf , Ks ) and a value vector ( Vf ,, Vs ) through linear mapping. The corresponding expression is:
Figure FDA0003987109550000055
Where Wsk , Wfk , Wsq , Wfq , Wsv and Wfv represent learnable weight matrices. The Qf feature of the ion video modality is used as the query vector, and the Ks and Vs features from the text modality are used as the key vector and value vector respectively. The similarity weight matrix between them is calculated, and the weighted video feature is obtained as
Figure FDA0003987109550000056
The contextual information of text and video features is integrated into the feature vector of the current position to obtain the corresponding time domain features.
9.根据权利要求1所述的基于多模态序列融合的动作识别方法,其特征在于,获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,包括:9. The method for action recognition based on multimodal sequence fusion according to claim 1, characterized in that obtaining a human action video to be recognized and annotating the action video to obtain a video frame comprises: 采用惯性传感器获取身体部位的运动特性,每次读入未标注的视频是,切换摄像头进行标注,标注完某个动作后,通过该动作起止时的截图判断是否正确,若出现标记错误的地方,则对摄像头进行微调或重新标注。Inertial sensors are used to obtain the motion characteristics of body parts. Each time an unlabeled video is read in, the camera is switched to label it. After labeling an action, the screenshots of the start and end of the action are used to determine whether it is correct. If there are incorrectly marked areas, the camera is fine-tuned or re-labeled. 10.一种根据权利要求1-8任一项所述的基于多模态序列融合的动作识别方法的基于多模态序列融合的动作识别系统,其特征在于,包括:10. An action recognition system based on multimodal sequence fusion according to the action recognition method based on multimodal sequence fusion according to any one of claims 1 to 8, characterized in that it comprises: 第一获取模块,用于获取待识别的人体动作视频,将所述动作视频进行动作标注得到视频帧,其中,所述动作标注包括动作的语义分割和时间线分割标签;A first acquisition module is used to acquire a human action video to be identified, and to perform action annotation on the action video to obtain a video frame, wherein the action annotation includes semantic segmentation and timeline segmentation labels of the action; 第二获取模块,用于获取人体动作对应的空间位置并检测出每一帧动作的空间候选框坐标和动作类别划分,采用连续帧之间的相关性进度动作时域检测,定位出动作发生的时间段,将每一帧的检测结果连接已形成动作的时空通道,使所述视频帧进行预处理得到人体动作对应的数据集;The second acquisition module is used to obtain the spatial position corresponding to the human body action and detect the spatial candidate frame coordinates and action category division of each frame action, use the correlation between consecutive frames to perform action time domain detection, locate the time period when the action occurs, connect the detection result of each frame to the spatiotemporal channel of the formed action, and pre-process the video frame to obtain the data set corresponding to the human body action; 特征提取模块,用于采用卷积神经网络和长短期记忆网络对数据集进行特征提取得到行为特征、与所述行为特征对应的时域特征;A feature extraction module, used to extract features from a data set using a convolutional neural network and a long short-term memory network to obtain behavioral features and time domain features corresponding to the behavioral features; 识别模块,用于根据所述行为特征和时域特征构建网络模型,将多个模态信息输入至所述网络模型进行特征融合并进行分类以完成人体的动作识别。The recognition module is used to build a network model according to the behavioral characteristics and time domain characteristics, and input multiple modal information into the network model for feature fusion and classification to complete human action recognition.
CN202211568552.3A 2022-12-08 2022-12-08 Action recognition method and system based on multi-modal sequence fusion Pending CN115937975A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211568552.3A CN115937975A (en) 2022-12-08 2022-12-08 Action recognition method and system based on multi-modal sequence fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211568552.3A CN115937975A (en) 2022-12-08 2022-12-08 Action recognition method and system based on multi-modal sequence fusion

Publications (1)

Publication Number Publication Date
CN115937975A true CN115937975A (en) 2023-04-07

Family

ID=86550146

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211568552.3A Pending CN115937975A (en) 2022-12-08 2022-12-08 Action recognition method and system based on multi-modal sequence fusion

Country Status (1)

Country Link
CN (1) CN115937975A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503957A (en) * 2023-06-26 2023-07-28 成都千嘉科技股份有限公司 Gas household operation behavior identification method
CN117409538A (en) * 2023-12-13 2024-01-16 吉林大学 Wireless anti-fall alarm system and method for nursing care
CN117953543A (en) * 2024-03-26 2024-04-30 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Character interaction detection method based on multiple texts, terminal and readable storage medium
CN118366076A (en) * 2024-04-09 2024-07-19 浙江安得仕科技有限公司 Video vector fusion analysis method and system based on deep learning
CN118570877A (en) * 2024-07-03 2024-08-30 广东工业大学 A method and system for giant panda behavior recognition based on deep learning
CN118781526A (en) * 2024-09-10 2024-10-15 深圳市博科思智能有限公司 Video analysis method and system for monitoring terminal
CN118821062A (en) * 2024-09-19 2024-10-22 山东大学 Multi-sensor fusion human intention recognition method, system, electronic device and medium
CN118821029A (en) * 2024-09-18 2024-10-22 山东网信安全科技有限公司 A video network asset identity credible identification method and system
CN118917297A (en) * 2024-09-30 2024-11-08 南昌虚拟现实研究院股份有限公司 Method and device for acquiring action annotation data set

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116503957A (en) * 2023-06-26 2023-07-28 成都千嘉科技股份有限公司 Gas household operation behavior identification method
CN116503957B (en) * 2023-06-26 2023-09-15 成都千嘉科技股份有限公司 Gas household operation behavior identification method
CN117409538A (en) * 2023-12-13 2024-01-16 吉林大学 Wireless anti-fall alarm system and method for nursing care
CN117953543A (en) * 2024-03-26 2024-04-30 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Character interaction detection method based on multiple texts, terminal and readable storage medium
CN118366076A (en) * 2024-04-09 2024-07-19 浙江安得仕科技有限公司 Video vector fusion analysis method and system based on deep learning
CN118570877A (en) * 2024-07-03 2024-08-30 广东工业大学 A method and system for giant panda behavior recognition based on deep learning
CN118781526A (en) * 2024-09-10 2024-10-15 深圳市博科思智能有限公司 Video analysis method and system for monitoring terminal
CN118821029A (en) * 2024-09-18 2024-10-22 山东网信安全科技有限公司 A video network asset identity credible identification method and system
CN118821062A (en) * 2024-09-19 2024-10-22 山东大学 Multi-sensor fusion human intention recognition method, system, electronic device and medium
CN118917297A (en) * 2024-09-30 2024-11-08 南昌虚拟现实研究院股份有限公司 Method and device for acquiring action annotation data set

Similar Documents

Publication Publication Date Title
CN115937975A (en) Action recognition method and system based on multi-modal sequence fusion
Özyer et al. Human action recognition approaches with video datasets—A survey
Subetha et al. A survey on human activity recognition from videos
Abbas et al. Video scene analysis: an overview and challenges on deep learning algorithms
Devanne et al. 3-d human action recognition by shape analysis of motion trajectories on riemannian manifold
Caputo et al. SHREC 2021: Skeleton-based hand gesture recognition in the wild
Zhang et al. From discriminant to complete: Reinforcement searching-agent learning for weakly supervised object detection
Ibraheem et al. Survey on various gesture recognition technologies and techniques
Xiang et al. Expression recognition using fuzzy spatio-temporal modeling
Colaco et al. Facial keypoint detection with convolutional neural networks
Tzirakis et al. Time-series clustering with jointly learning deep representations, clusters and temporal boundaries
de Araujo Zeni et al. Real-time gender detection in the wild using deep neural networks
CN101354787A (en) A method for extracting features of target motion trajectory in intelligent visual surveillance retrieval
Hazourli et al. Deep multi-facial patches aggregation network for facial expression recognition
Zerrouki et al. Deep learning for hand gesture recognition in virtual museum using wearable vision sensors
Chen et al. A multi-scale fusion convolutional neural network for face detection
Sheeba et al. Hybrid features-enabled dragon deep belief neural network for activity recognition
Doždor et al. TY-Net: Transforming YOLO for hand gesture recognition
Hossain et al. A hybrid deep learning framework for daily living human activity recognition with cluster-based video summarization
Arya et al. Enhancing Human Pose Estimation: A Data-Driven Approach with MediaPipe BlazePose and Feature Engineering Analysing
CN116561649A (en) Method and system for diver's motion state recognition based on multi-source sensor data
Zerrouki et al. Exploiting deep learning-based LSTM classification for improving hand gesture recognition to enhance visitors’ museum experiences
Ghaderi et al. Weakly supervised pairwise Frank–Wolfe algorithm to recognize a sequence of human actions in RGB-D videos
Chong et al. Modeling video-based anomaly detection using deep architectures: Challenges and possibilities
Moayedi et al. Human action recognition: Learning sparse basis units from trajectory subspace

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination