CN117173758A

CN117173758A - Learning attention state assessment method based on multidimensional feature fusion network

Info

Publication number: CN117173758A
Application number: CN202211662783.0A
Authority: CN
Inventors: 田斌; 李少义; 黎曦; 罗芷萱; 侯常辉; 刘婷婷; 刘海; 肖振华
Original assignee: Nanchang Institute of Technology
Current assignee: Nanchang Institute of Technology
Priority date: 2022-12-23
Filing date: 2022-12-23
Publication date: 2023-12-05

Abstract

Aiming at the problem that a learner is difficult to evaluate the self attention state in real time, the invention discloses a learning attention state evaluation method based on a multidimensional feature fusion network. The method comprises the following steps: 1) Acquiring a learner video resource acquired by a binocular imaging device (a short wave infrared camera and a laser radar scanner) on an office table, and dividing the learner video resource into multiple frames of images; the hand wearable device acquires blood oxygen saturation and heart rate signals of a learner; 2) And (5) locating the face area and the facial feature points of the SWIR image of the learner. Segmenting the head region 3D point cloud set; 3) Inputting the SWIR image of the face region and the head 3D point cloud set into a corresponding feature extraction network to obtain a feature topological graph, and inputting the feature topological graph into a Cauchy tag distribution regression module to obtain the head posture angle of the learner after the feature topological graph is fused by a self-attention weighting module. Extracting blood oxygen saturation and heart rate variation characteristics, and judging the fatigue level of a learner; 4) Comprehensively evaluating the attention state according to the head posture angle and the facial feature point change of the learner and the fatigue degree, and reminding the learner if the attention state is not concentrated; 5) And counting the attention state condition in the learning process, and feeding back a statistical analysis report. The invention helps the learner to improve concentration degree and cultivate good learning habit by comprehensively evaluating and statistically feeding back the attention state of the learner.

Description

A learning attention state evaluation method based on multi-dimensional feature fusion network

技术领域Technical Field

本发明涉及计算机视觉领域，行为分析领域，具体地说，涉及一种基于多维度特征融合网络的学习注意力状态评估方法。The present invention relates to the fields of computer vision and behavior analysis, and in particular to a learning attention state evaluation method based on a multi-dimensional feature fusion network.

背景技术Background Art

随着学习资源获取途径越来越丰富全面，通过自学提高个人能力俨然是未来的学习发展趋势，但在无监督环境下，特别是居家环境下，很多学习者注意力容易分散，学习效率低下。近年来人工智能技术因其便利性，高效性在很多领域中被普遍应用，在自学环境中，学习者往往无法及时意识到并纠正自己的行为，因而可以利用人工智能技术特别是姿态识别技术来实时监督学习者的学习情况，判断学习者注意力是否集中，从而更好的帮助学习者养成良好的学习习惯。As learning resources become more and more comprehensive, improving personal abilities through self-study is the future development trend of learning. However, in an unsupervised environment, especially at home, many learners are easily distracted and their learning efficiency is low. In recent years, artificial intelligence technology has been widely used in many fields due to its convenience and efficiency. In a self-study environment, learners are often unable to realize and correct their behavior in time. Therefore, artificial intelligence technology, especially posture recognition technology, can be used to monitor learners' learning in real time and judge whether learners are focused, so as to better help learners develop good learning habits.

头部姿态是反映学习者注意力的重要表现形式，通过分析学习者在学习过程中的头部姿态角度变化，同时结合面部特征点变化可以有效地判断学习者注意力是否集中,如果学习者在学习过程中头部偏转角度位于书桌或者屏幕外，或是出现频繁打哈欠甚至闭眼的行为，系统能及时检测到并提醒学习者集中注意力。此外通过血氧饱和度及心率变化特征能很好的反映学习者的疲劳程度，疲劳度会直接影响注意力的集中程度。然而，目前头部姿态估计也还面临着一些挑战：Head posture is an important manifestation of learners' attention. By analyzing the changes in head posture angles during the learning process and combining them with changes in facial feature points, we can effectively determine whether the learners are paying attention. If the learners' head deflection angle is outside the desk or screen during the learning process, or they yawn frequently or even close their eyes, the system can detect it in time and remind the learners to concentrate. In addition, the blood oxygen saturation and heart rate change characteristics can well reflect the learners' fatigue level, which will directly affect the degree of concentration. However, head posture estimation currently still faces some challenges:

⑴学习者在学习过程中，容易出现手部遮挡，发型遮挡，服饰遮挡头部等问题，此外，室内场景容易出现照明度不足的问题。这些都会导致采集的图像质量较差，无法获取到完整的信息。因此需要能多维度采集头部姿态信息且不受照明度变化影响的图像采集设备。⑴ During the learning process, learners are prone to problems such as hand occlusion, hairstyle occlusion, and clothing occlusion of the head. In addition, indoor scenes are prone to insufficient lighting. These will lead to poor quality of collected images and inability to obtain complete information. Therefore, image acquisition equipment is needed that can collect head posture information in multiple dimensions and is not affected by changes in lighting.

⑵如今用于训练头部姿态角度的数据集，训练样本的分布极度不平衡，没有足够多的大姿态训练样本，且很多训练集中的图片都存在头部姿态角度错标的问题。导致无法训练得到鲁棒性强的网络参数。⑵ The distribution of training samples in the current dataset used for training head posture angles is extremely unbalanced, there are not enough large posture training samples, and many pictures in the training set have the problem of mislabeling head posture angles. This makes it impossible to train robust network parameters.

⑶目前对于头部姿态估计问题，大部分方法仅仅只基于RGB图像或者头部3D点云数据来回归头部姿态角度，基于二维图像的方法回归角度准确率难以提升，基于三维点云的方法往往计算量过大。因此需要一种轻量级的综合二维与三维特征的头部姿态估计的方法。⑶ Currently, most methods for head posture estimation only regress head posture angles based on RGB images or head 3D point cloud data. The accuracy of angle regression based on two-dimensional images is difficult to improve, and the calculation amount of three-dimensional point cloud based methods is often too large. Therefore, a lightweight head posture estimation method that integrates two-dimensional and three-dimensional features is needed.

发明内容Summary of the invention

针对现有技术的改进需求，本发明采用了具有短波红外摄像头、激光雷达扫描仪的双目成像设备，并提供一种基于多维度特征融合网络的学习注意力状态评估方法，可以结合学习者的疲劳程度实时监督学习者是否存在头部偏转角度位于书桌或者屏幕外区域，闭眼等注意力不集中行为，及时提醒学习者集中注意力的同时，生成学习期间的注意力集中度报告，帮助学习者养成良好的学习习惯。In response to the need for improvement in the prior art, the present invention adopts a binocular imaging device with a short-wave infrared camera and a lidar scanner, and provides a learning attention state assessment method based on a multi-dimensional feature fusion network. The method can monitor in real time whether the learner has inattention behaviors such as head deflection angle to the desk or outside the screen, closing eyes, etc. in combination with the learner's fatigue level. The method can timely remind the learner to concentrate and generate an attention concentration report during the learning period to help learners develop good learning habits.

本发明解决其技术问题所采用的技术方案是：一种基于多维度特征融合网络的学习注意力状态评估方法，包括如下步骤：The technical solution adopted by the present invention to solve the technical problem is: a learning attention state evaluation method based on a multi-dimensional feature fusion network, comprising the following steps:

获取办公桌上双目成像设备(短波红外摄像头、激光雷达扫描仪)采集的学习者视频资源，并划分为多帧图像；手部可穿戴设备获取学习者的血氧饱和度及心率信号；Obtain learner video resources collected by binocular imaging devices (short-wave infrared cameras, lidar scanners) on the desk and divide them into multiple frames; wearable devices on the hands obtain learners' blood oxygen saturation and heart rate signals;

对学习者的SWIR图像进行人脸区域及面部特征点定位。分割出头部区域3D点云集；Locate the face area and facial feature points of the learner's SWIR image. Segment the 3D point cloud of the head area;

将人脸区域SWIR图像和头部3D点云集输入对应的特征提取网络获取特征拓扑图，经过自注意力加权模块融合后，输入柯西标签分布回归模块得到学习者的头部姿态角度。提取血氧饱和度及心率变化特征，判断学习者的疲劳等级；The SWIR image of the face area and the 3D point cloud set of the head are input into the corresponding feature extraction network to obtain the feature topology map. After being fused by the self-attention weighted module, they are input into the Cauchy label distribution regression module to obtain the learner's head posture angle. The blood oxygen saturation and heart rate change features are extracted to determine the learner's fatigue level;

根据学习者的头部姿态角度和面部特征点位置变化，及疲劳程度，综合评估注意力状态，若不集中则提醒学习者；Comprehensively assess the learner's attention status based on the learner's head posture angle, changes in the position of facial feature points, and fatigue level, and remind the learner if he/she is not concentrating;

统计学习过程中注意力状态情况，并反馈统计分析报告。Statistics on attention status during learning process and provide statistical analysis report.

按上述方案，所述人脸区域及面部特征点定位模块过程如下：According to the above scheme, the face area and facial feature point positioning module process is as follows:

步骤1.1.1：将交互对象的每帧SWIR图像大小调整为624×624像素，输入到在人脸数据集上预训练好的轻量级MaskR-CNN网络中，获得人脸区域(I_x,I_y,m,n)；Step 1.1.1: Resize each frame of the SWIR image of the interacting object to 624×624 pixels and input it into the lightweight Mask R-CNN network pre-trained on the face dataset to obtain the face region (I _x ,I _y ,m,n);

步骤1.2.1：裁剪后的人脸区域SWIR图像输入全局粗略特征提取网络RG-Net，RG-Net网络结构可表示为{conv1-res1-res2-res3-glDSC-fc}，其中conv1代表卷积层，res代表残差连接层，glDSC代表全局通道可分离卷积，fc代表全连接层，回归最终的全局粗略特征点坐标向量P₀；Step 1.2.1: The cropped face area SWIR image is input into the global coarse feature extraction network RG-Net. The RG-Net network structure can be expressed as {conv1-res1-res2-res3-glDSC-fc}, where conv1 represents the convolution layer, res represents the residual connection layer, glDSC represents the global channel separable convolution, and fc represents the fully connected layer. The final global coarse feature point coordinate vector P ₀ is regressed;

步骤1.2.2：取RG-Net网络中res1层的输出特征图在对应处裁剪以粗略特征点(x_j,y_j)为中心的p×q大小的特征图，得到一级细化特征图将F_R1输入局部细化网络FL-Net提取特征向量一级细化面部特征点坐标向量 Step 1.2.2: Take the output feature map of the res1 layer in the RG-Net network At the corresponding position, a feature map of size p×q centered on the coarse feature point (x _j , y _j ) is cropped to obtain a first-level refined feature map Input _FR1 into the local refinement network FL-Net to extract the feature vector First-level refined facial feature point coordinate vector

步骤1.2.3：取RG-Net网络中conv1层的输出特征图在对应处裁剪以粗略特征点(x_j,y_j)为中心的p×q大小的特征图。得到二级细化特征图将F_R2输入局部细化网络FL-Net提取特征向量二级细化面部特征点坐标向量P₂ ^T即为稀疏面部特征点坐标向量。Step 1.2.3: Take the output feature map of the conv1 layer in the RG-Net network At the corresponding position, a feature map of size p×q centered on the coarse feature point (x _j ,y _j ) is cropped. The second-level refined feature map is obtained Input _FR2 into the local refinement network FL-Net to extract the feature vector Second level refined facial feature point coordinate vector P ₂ ^T is the coordinate vector of the sparse facial feature points.

头部姿态二维特征提取模型包括通道可分离卷积模块、像素空间Transformer模块、融合特征拓扑图构建模块、自适应图卷积模块。所述通道可分离卷积模块用于提取预处理后的SWIR人脸区域图像像素空间的局部特征。所述像素空间Transformer从局部特征图中提取像素空间全局特征关系。所述自适应图卷积模块更新图顶点的值，得到新维度的头部姿态融合特征拓扑图。The head posture two-dimensional feature extraction model includes a channel separable convolution module, a pixel space transformer module, a fusion feature topology map construction module, and an adaptive graph convolution module. The channel separable convolution module is used to extract the local features of the pixel space of the preprocessed SWIR face area image. The pixel space transformer extracts the global feature relationship of the pixel space from the local feature map. The adaptive graph convolution module updates the values of the graph vertices to obtain a head posture fusion feature topology map of a new dimension.

按上述方案，所述通道可分离卷积模块过程如下：According to the above scheme, the channel separable convolution module process is as follows:

步骤2.1.1：将一组裁剪好的328×328像素值大小SWIR人脸区域图像I_swir∈R^N ^×H×W×C输入到双分支通道可分离卷积网络，提取图像的局部特征；Step 2.1.1: Input a set of cropped 328×328 pixel SWIR face region images I _swir ∈ R ^N ^×H×W×C into a dual-branch channel-separable convolutional network to extract local features of the image;

步骤2.1.2：第I分支结构为{SC_MAX(16)-SC₁(32)-SC_MAX(32)},其中SC₁模块结构为[SC,BN,RL]，SC代表通道可分离卷积，对各通道逐点卷积提取局部特征。BN代表对该批输入的一批图像做归一化处理，分别对C个通道做批量归一化处理，批归一化的计算公式可以表述为如下：Step 2.1.2: The structure of the first branch is {SC_MAX(16)-SC ₁ (32)-SC_MAX(32)}, where the SC ₁ module structure is [SC, BN, RL]. SC stands for channel-separable convolution, which extracts local features from each channel point-by-point convolution. BN stands for normalization of a batch of images input to the batch, and batch normalization is performed on C channels respectively. The calculation formula of batch normalization can be expressed as follows:

其中Γ＝{Γ₁,,,Γ_N×H×W}代表每个通道对应的一组元素即一组像素值，代表该组像素值的平均值，即ζ为极小正数，避免标准差即分母为0，α，b为网络训练参数，缩放和平移最终的标准化结果。RL激活函数将负值元素替换为零，使得特征图元素值更易于收敛。SC_MAX即在SC₁基础上做局部最大化patch_max，得到头部姿态局部特征图I_{s1_1}∈R^{N×H′×W′×C′}；Where Γ＝{Γ ₁ ,,,Γ _N×H×W } represents a set of elements corresponding to each channel, that is, a set of pixel values. Represents the average value of the group of pixels, that is ζ is a very small positive number to avoid the standard deviation, i.e. the denominator is 0. α and b are network training parameters, which scale and translate the final standardized results. The RL activation function replaces negative elements with zero, making the feature map element values easier to converge. SC_MAX is based on SC ₁ to perform local maximization patch_max, and obtain the head posture local feature map I _{s1_1} ∈R ^{N×H′×W′×C′} ;

步骤2.1.3：第II分支结构为{SC_AVE(16)-SC₂(32)-SC_AVE(32)},其中SC₂模块结构为[SC,BN,TH]，TH激活函数使元素值域归一化到(-1,1)，使得网络更易于收敛。SC_AVE即在SC₂基础上做局部平均化，最得到头部姿态局部特征图I_{s2_1}∈R^{N×H′×W′×C′}。Step 2.1.3: The structure of the second branch is {SC_AVE(16)-SC ₂ (32)-SC_AVE(32)}, where the SC ₂ module structure is [SC, BN, TH], and the TH activation function normalizes the element value range to (-1, 1), making the network easier to converge. SC_AVE is based on SC ₂ to perform local averaging, and the local feature map of head posture I _{s2_1} ∈R ^{N×H′×W′×C′} is obtained.

按上述方案，所述像素空间Transformer模块训练过程如下：According to the above scheme, the pixel space Transformer module training process is as follows:

步骤2.2.1：将局部特征图I_{s1_1}，I_{s2_1}输入双分支二阶段像素空间Transformer网络中提取像素空间全局特征，生成融合特征图；Step 2.2.1: Input the local feature maps I _{s1_1} and I _{s2_1} into the dual-branch two-stage pixel space Transformer network to extract the pixel space global features and generate a fused feature map;

步骤2.2.2：将I_{s1_1}输入第I分支，第I分支第一阶段结构为{SC_MAX(32)-Transformer-Patch_Max}，第二阶段结构为{SC_MAX(32)-Transformer}。像素空间Transformer层由三个像素空间Transformer编码器级联组成，提取像素空间全局特征关系；Step 2.2.2: Input I _{s1_1} into the I branch. The first stage structure of the I branch is {SC_MAX(32)-Transformer-Patch_Max}, and the second stage structure is {SC_MAX(32)-Transformer}. The pixel space Transformer layer consists of three pixel space Transformer encoders cascaded to extract the global feature relationship of the pixel space;

步骤2.2.3：像素空间Transformer编码器将可分离卷积层SC的输Step 2.2.3: The pixel-space Transformer encoder converts the input of the separable convolutional layer SC into

出特征图I_sc∈R^{N×H″×W″×C′}拉伸为三维嵌入向量I_emb∈R^{N×A′×C′},其中A′＝H″×W″；The feature map I _sc ∈ ^{RN×H″×W″×C′} is stretched into a three-dimensional embedding vector I _emb ∈ ^{RN×A′×C′} , where A′＝H″×W″;

步骤2.2.4：分别给每幅图像对应的嵌入向量i∈[1,N]的每个元素即像素点添加位置编码,可表述为如下表达式，Step 2.2.4: Give each image the corresponding embedding vector Each element of i∈[1,N], i.e., the pixel point, is added with position coding, which can be expressed as follows:

其中m∈[0,A′-1],n∈[0,(C′-1)/2]。嵌入向量I_emb更新为I_emb+I_P.将I_emb+I_P输入多头自注意力映射模块；Where m∈[0,A′-1],n∈[0,(C′-1)/2]. The embedding vector _Iemb is updated to _Iemb + _Ip . _Iemb + _Ip is input into the multi-head self-attention mapping module;

步骤2.2.5：多头自注意力映射模块包含8通道自注意力头，每个通道基于输入得到自映射权值矩阵，输入与之点乘后通过非线性变换得到每个通道的自注意力映射多头自注意力映射模块最终输出为 Step 2.2.5: The multi-head self-attention mapping module contains 8-channel self-attention heads. Each channel obtains a self-mapping weight matrix based on the input. After the input is dot-multiplied with it, the self-attention map of each channel is obtained through nonlinear transformation. The final output of the multi-head self-attention mapping module is

步骤2.2.6：多头自注意力映射模块的输出I_A经过残差归一化层和全连接层，得到像素空间Transformer编码器的输出。三个像素空间Transformer编码器级联，最终得到头部姿态融合特征图MAP₁∈R^{N×H″′×W″′×C′}；Step 2.2.6: The output of the multi-head self-attention mapping module I _A passes through the residual normalization layer and the fully connected layer to obtain the output of the pixel space Transformer encoder. The three pixel space Transformer encoders are cascaded to finally obtain the head posture fusion feature map MAP ₁ ∈R ^{N×H″′×W″′×C′} ;

步骤2.2.7：将I_{s2_1}输入第II分支，第II分支与第I分支结构类似，但基于局部平均提取不同的特征图，第一阶段结构为{SC_AVE(32)-Transformer-Patch_Ave}，第二阶段结构为{SC_AVE(32)-Transformer}，最终得到头部姿态融合特征图MAP₂∈R^{N×H″′×W″′×C′}。Step 2.2.7: Input I _{s2_1} into the II branch. The structure of the II branch is similar to that of the I branch, but different feature maps are extracted based on local average. The structure of the first stage is {SC_AVE(32)-Transformer-Patch_Ave}, and the structure of the second stage is {SC_AVE(32)-Transformer}. Finally, the head posture fusion feature map MAP ₂ ∈R ^{N×H″′×W″′×C′} is obtained.

按上述方案，所述融合特征拓扑图构建模块包括融合特征图顶点与拓扑连接矩阵T的构造。将MAP₁,MAP₂的元素点乘得到总融合特征图MAP，通过全连接层将MAP映射到低维的融合特征向量N表示一批图像数量，分别为每帧图像构造融合特征拓扑图。融合特征图顶点的值同为单幅图像的融合特征向量融合特征拓扑图与3D点云拓扑图共享拓扑连接矩阵T。构建融合特征拓扑图G₂＝(V^M,T)，其中 According to the above scheme, the fusion feature topology map construction module includes the fusion feature map vertex The topological connection matrix T is constructed. The total fusion feature map MAP is obtained by multiplying the elements of MAP ₁ and MAP ₂ , and the MAP is mapped to a low-dimensional fusion feature vector through the fully connected layer. N represents the number of images in a batch, and a fusion feature topology map is constructed for each frame of image. The value of is the fusion feature vector of a single image The fused feature topology map and the 3D point cloud topology map share the topological connection matrix T. Construct the fused feature topology map G ₂ = (V ^M , T), where

所述的头部点云分割模块过程如下，根据人脸区域检测框提供的二维坐标信息(I_x,I_y,m,n)，对比每帧点云图对应稠密点云集合pic，从点云图中筛选出[I_x,I_y]～[I_x+m,I_y+n]范围内的点云集合，得到头部区域稠密点云集合pic₁＝{(x₁,y₁,z₁),,,,,(x_n,y_n,z_n)}。The head point cloud segmentation module process is as follows: according to the two-dimensional coordinate information ( _Ix , _Iy , m, n) provided by the face area detection frame, the dense point cloud set pic corresponding to each frame of the point cloud image is compared, and the point cloud set in the range of [ _Ix , _Iy ] to [ _Ix +m, _Iy +n] is screened out from the point cloud image to obtain the head area dense point cloud set _pic1 = {( _x1 , _y1 , _z1 ),,,,,( _xn , _yn , _zn )}.

所述的头部姿态三维特征提取模型包括面部特征点3D点云拓扑图构建模块、自适应图卷积模块。所述面部特征点3D点云拓扑图构建模块包括3D点云图顶点与拓扑连接矩阵T的构造。所述自适应图卷积模块提取拓扑图各顶点对间权重关系，更新图顶点的值，得到新维度的头部姿态3D点云拓扑图。The head posture three-dimensional feature extraction model includes a facial feature point 3D point cloud topology map construction module and an adaptive graph convolution module. The facial feature point 3D point cloud topology map construction module includes a 3D point cloud map vertex The adaptive graph convolution module extracts the weight relationship between each vertex pair in the topological graph, updates the value of the graph vertex, and obtains a new dimensional head posture 3D point cloud topological graph.

按上述方案，所述面部特征点3D点云拓扑图构建模块过程如下：According to the above scheme, the facial feature point 3D point cloud topology map construction module process is as follows:

步骤3.1.1：根据面部特征点二维坐标信息P₂ ^T，从点云集合pic₁中选取对应的3D点云坐标。3D点云图顶点的值即为面部关键点3D点云坐标值pic_key＝(x_{key_i},y_{key_i},z_{key_i}),i＝1,,,25；Step 3.1.1: According to the 2D coordinate information of facial feature points P ₂ ^T , select the corresponding 3D point cloud coordinates from the point cloud set pic _1. 3D point cloud vertex The value is the 3D point cloud coordinate value of the facial key point pic _key = (x _{key_i} , y _{key_i} , z _{key_i} ), i = 1,,,25;

步骤3.1.2：基于KD-Tree算法寻找各图顶点在欧氏空间中距离最近的5个顶点进行连接，构建拓扑连接矩阵T∈R^N×N,N为特征点个数,T(i,j)值为1代表图顶点相连，反之为0；Step 3.1.2: Find the vertices of each graph based on the KD-Tree algorithm Connect the five vertices closest to each other in Euclidean space and construct a topological connection matrix T∈R ^N×N , where N is the number of feature points. The value of T(i,j) is 1, which means that the vertices are connected, otherwise it is 0.

步骤3.1.3：构建3D点云拓扑图G₁＝(V^D,T)，其中 Step 3.1.3: Construct a 3D point cloud topology graph G ₁ = (V ^D , T), where

按上述方案，所述自适应图卷积模块训练过程如下：According to the above scheme, the training process of the adaptive graph convolution module is as follows:

步骤4.1.1：自适应图卷积模块网络结构为自适应图卷积层，批归一化层，RL函数激活层，1维卷积层，批归一化层，RL函数激活层，将多个输入的特征拓扑图G的图顶点值v_i更新为192维的特征值 Step 4.1.1: The network structure of the adaptive graph convolution module is an adaptive graph convolution layer, a batch normalization layer, an RL function activation layer, a 1D convolution layer, a batch normalization layer, and an RL function activation layer. The vertex values _VI of the feature topology graph G of multiple inputs are updated to 192-dimensional feature values.

步骤4.1.2：自适应图卷积层选取特征图G各顶点Vⁿ及其邻域距离最近的K个顶点构成顶点对，对于每个顶点对构建M个通道，每个通道独立计算特征值，级联K个顶点对特征值。经过通道最大池化得到更新后的图顶点其中取K为6，M为192。Step 4.1.2: The adaptive graph convolution layer selects each vertex ^Vn of the feature graph G and its K nearest vertices in the neighborhood to form a vertex pair. For each vertex pair Construct M channels, calculate the eigenvalues of each channel independently, and concatenate the eigenvalues of K vertex pairs. After channel maximum pooling, the updated graph vertices are obtained Here K is taken as 6 and M is taken as 192.

所述血氧饱和度-心电信号特征提取模块过程如下：The blood oxygen saturation-ECG signal feature extraction module process is as follows:

步骤5.1.1：对于血氧饱和度spo2，计算其一个周期内采样值的均方差其中N为采样次数，sp_i为第i个采样值，为一个周期内的采样平均值；Step 5.1.1: For blood oxygen saturation spo2, calculate the mean square error of the sampling values within one cycle Where N is the number of sampling times, sp _i is the i-th sampling value, is the sampling average value within a period;

步骤5.1.2：对于心电信号，计算连续两次心跳信号相邻R波出现间期标准差σ_RR，相邻R波周期内高频与超低频能量谱密度之比表示两个相邻波峰的间期。根据间期内频谱图计算超低频和高频的频谱密度Θ_HF，Θ_SLF之比即为γ；Step 5.1.2: For the ECG signal, calculate the standard deviation of the interval between two consecutive R waves of the heartbeat signal, σ _RR , and the ratio of the high-frequency to ultra-low-frequency energy spectral density in the adjacent R wave cycles. Represents the interval between two adjacent peaks. The ratio of the spectrum density of ultra-low frequency and high frequency Θ _HF and Θ _SLF is calculated according to the spectrum diagram within the interval, which is γ;

步骤5.1.3：综合分析σ_sp，σ_RR，γ的变化情况，若Δσ_sp<0.005,Δσ_RR<10ms，Δγ<0.2则判断为疲劳等级1，意识清醒，思维活跃。若Δσ_sp∈[0.005,0.01),Δσ_RR∈[10ms,35ms)，Δγ∈[0.2,0.8)则判断为疲劳等级2，意识较模糊，思维松懈。若Δσ_sp≥0.01,Δσ_RR≥35ms，Δγ≥0.8则判断为疲劳等级3，意识模糊，思维无法集中。Step 5.1.3: Comprehensively analyze the changes in σ _sp , σ _RR , and γ. If Δσ _sp <0.005, Δσ _RR <10ms, and Δγ<0.2, it is judged as fatigue level 1, with clear consciousness and active thinking. If Δσ _sp ∈[0.005,0.01), Δσ _RR ∈[10ms,35ms), Δγ∈[0.2,0.8), it is judged as fatigue level 2, with blurred consciousness and slack thinking. If Δσ _sp ≥0.01, Δσ _RR ≥35ms, and Δγ≥0.8, it is judged as fatigue level 3, with blurred consciousness and inability to concentrate.

所述自注意力加权模块过程如下：The self-attention weighted module process is as follows:

步骤6.1.1：自注意力加权模块包括自注意力层，全连接层，softmax回归层。上一模块得到的更新后的三维点云拓扑图和二维融合特征拓扑图输入自注意力层后更新为 Step 6.1.1: The self-attention weighted module includes a self-attention layer, a fully connected layer, and a softmax regression layer. The updated 3D point cloud topology obtained in the previous module and two-dimensional fusion feature topology map After entering the self-attention layer, it is updated to

步骤6.1.2：全连接层将映射为1×N′维的向量，最后利用softmax函数计算和的加权参数α₁，α₂，得到最后的加权融合特征拓扑图为 Step 6.1.2: The fully connected layer will Mapped to a 1×N′ dimensional vector, and finally calculated using the softmax function and The weighted parameters α ₁ , α ₂ , the final weighted fusion feature topology is

所述柯西标签分布回归模块过程如下：The Cauchy label distribution regression module process is as follows:

步骤7.1.1：通过全连接层将加权融合特征拓扑图映射为多维特征向量，回归准确的头部姿态的角度，计算与真实角度的MAE作为损失函数Loss_M；Step 7.1.1: Use the fully connected layer to combine the weighted fusion feature topology Map it into a multi-dimensional feature vector, regress the accurate head posture angle, and calculate the MAE with the real angle as the loss function Loss _M ;

步骤7.1.2：对于每个训练集图像I_i，把实际角度标签转换为柯西标签分布同时该模块训练网络生成三组参数δ,η,ζ，得到预测柯西标签概率分布(P^A(I_i；δ),P^B(I_i；η),P^C(I_i；ζ))；Step 7.1.2: For each training set image I _i , convert the actual angle label to the Cauchy label distribution At the same time, the module trains the network to generate three sets of parameters δ, η, ζ, and obtains the predicted Cauchy label probability distribution ( ^PA ( _Ii ; δ), ^PB ( _Ii ; η), ^PC ( _Ii ; ζ));

步骤7.1.4：计算预测柯西标签概率分布与实际柯西标签分布的空间距离Loss_θ和KL散度作为损失函数Loss_G，与损失函数Loss_M加权得到最终的损失函数Loss_total＝Loss_G+0.06Loss_M。Step 7.1.4: Calculate the spatial distance Loss _θ and KL divergence between the predicted Cauchy label probability distribution and the actual Cauchy label distribution As the loss function Loss _G , it is weighted with the loss function Loss _M to obtain the final loss function Loss _total = Loss _G + 0.06 Loss _M .

按上述方案，预先根据损失函数在训练集上训练得到最优网络参数，将学习者的短波红外图像和3d点云数据输入预训练好的多维度特征融合自注意力网络，即可得到学习者实时的头部姿态角Yaw，Pitch，Roll判断其是否位于注意力不集中区间，同时结合面部特征点位置情况以及学习者疲劳程度，综合判断学习者的注意力集中度情况，若不集中则提醒学习者。According to the above scheme, the optimal network parameters are obtained by training on the training set according to the loss function in advance, and the learner's short-wave infrared image and 3D point cloud data are input into the pre-trained multi-dimensional feature fusion self-attention network to obtain the learner's real-time head posture angles Yaw, Pitch, and Roll to determine whether he is in the inattention zone. At the same time, combined with the position of facial feature points and the learner's fatigue level, the learner's concentration level is comprehensively judged, and the learner is reminded if he is not focused.

总体而言，本发明与现有技术相比，具有有益效果：In general, compared with the prior art, the present invention has the following beneficial effects:

(1)本发明分别采集了短波红外视频图像和3D点云数据，从多维度采集头部姿态信息且不受照明度变化影响。综合考虑了头部姿态的二维和三维信息，能回归得到更准确的头部姿态角。(1) The present invention collects short-wave infrared video images and 3D point cloud data respectively, collects head posture information from multiple dimensions and is not affected by changes in illumination. It comprehensively considers the two-dimensional and three-dimensional information of the head posture, and can regress to obtain a more accurate head posture angle.

(2)多维度特征融合自注意力网络利用面部关键点空间信息构造拓扑图结构，根据二维特征和三维特征分别构造头部姿态拓扑图，其中头部姿态二维特征提取将提取局部信息的卷积操作和提取全局信息的像素空间Transformer结合起来，提取特征更全面的局部-全局二维融合特征。柯西标签分布回归模块充分考虑了相邻头部姿态之间的相似性，从而解决了训练集没有足够多的大姿态训练样本的问题。(2) The multi-dimensional feature fusion self-attention network uses the spatial information of facial key points to construct a topological map structure, and constructs a head posture topological map based on two-dimensional features and three-dimensional features. The two-dimensional feature extraction of head posture combines the convolution operation that extracts local information with the pixel space Transformer that extracts global information to extract more comprehensive local-global two-dimensional fusion features. The Cauchy label distribution regression module fully considers the similarity between adjacent head postures, thereby solving the problem that the training set does not have enough large posture training samples.

(3)为了辅助头部姿态角来表征学习者注意力集中度，提出采集血氧饱和度及心电信号，通过综合分析提取到的对应参数σ_sp，σ_RR，γ的变化情况，可以定性的判断学习者的疲劳程度。(3) In order to assist the head posture angle in characterizing the learner's concentration, it is proposed to collect blood oxygen saturation and electrocardiogram signals. By comprehensively analyzing the changes in the corresponding parameters σ _sp , σ _RR , and γ extracted, the learner's fatigue level can be qualitatively judged.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例的一种基于多维度特征融合网络的学习注意力状态评估方法的流程图FIG1 is a flow chart of a method for evaluating learning attention state based on a multi-dimensional feature fusion network according to an embodiment of the present invention.

图2是居家环境下数据获取示意图；Figure 2 is a schematic diagram of data acquisition in a home environment;

图3是本发明实施例的多维度特征融合网络结构示意图。FIG3 is a schematic diagram of a multi-dimensional feature fusion network structure according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the purpose, technical solutions and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

如图1所示，本发明实施例的是一种基于多维度特征融合网络的学习注意力状态评估方法，包括步骤：As shown in FIG1 , an embodiment of the present invention is a method for evaluating learning attention state based on a multi-dimensional feature fusion network, comprising the steps of:

步骤1：获取办公桌上双目成像设备(短波红外摄像头、激光雷达扫描仪)采集的学习者视频资源，并按时间顺序划分为多帧图像。同时通过手部可穿戴设备获取学习者的血氧饱和度及心率信号。Step 1: Obtain the learner's video resources collected by the binocular imaging device (short-wave infrared camera, lidar scanner) on the desk and divide them into multiple frames in chronological order. At the same time, obtain the learner's blood oxygen saturation and heart rate signals through the wearable device on the hand.

步骤2：对学习者的SWIR图像进行人脸区域及面部特征点定位，同时分割出头部区域3D点云集。Step 2: Locate the face area and facial feature points of the learner’s SWIR image, and segment the 3D point cloud of the head area.

步骤3：将人脸区域SWIR图像和头部3D点云集输入到对应的头部姿态二维，三维特征提取网络中获取特征拓扑图，经过自注意力加权模块融合后，输入柯西标签分布回归模块得到学习者的头部姿态角度。同时提取血氧饱和度及心率变化特征，判断学习者的疲劳等级。Step 3: Input the face area SWIR image and the head 3D point cloud set into the corresponding head posture 2D and 3D feature extraction network to obtain the feature topology map. After fusion by the self-attention weighted module, input it into the Cauchy label distribution regression module to obtain the learner's head posture angle. At the same time, extract the blood oxygen saturation and heart rate change features to determine the learner's fatigue level.

按上述方案，所述血氧饱和度-心电信号特征提取模块过程为：以5分钟为一个周期，计算一个周期内血氧饱和度spo2采样值的均方差其中N为采样次数，sp_i为第i个采样值，为一个周期内的采样平均值。同时运用小波变换检测出一个周期内的心电信号时域波形的所有波峰，求出两个相邻波峰的间期得到参数连续两次心跳信号相邻R波出现间期标准差运用快速傅里叶变换将该周期内的时域信号转换为频域信号，分析频谱图，计算超低频和高频的频谱密度Θ_HF，Θ_SLF，即可得到第二个参数 According to the above scheme, the blood oxygen saturation-ECG signal feature extraction module process is: take 5 minutes as a cycle, calculate the mean square error of the blood oxygen saturation spo2 sampling value within a cycle Where N is the number of sampling times, sp _i is the i-th sampling value, is the sampling average value within a cycle. At the same time, wavelet transform is used to detect all the peaks of the ECG signal time domain waveform within a cycle, and the interval between two adjacent peaks is calculated. Get the parameter: the standard deviation of the interval between two consecutive heartbeat signals with adjacent R waves Use fast Fourier transform to convert the time domain signal within the period into a frequency domain signal, analyze the spectrum diagram, calculate the spectrum density of the ultra-low frequency and high frequency Θ _HF , Θ _SLF , and you can get the second parameter

综合分析σ_sp，σ_RR，γ的变化情况，若Δσ_sp<0.005,Δσ_RR<10ms，Δγ<0.2则判断为疲劳等级1，意识清醒，思维活跃。若Δσ_sp∈[0.005,0.01),Δσ_RR∈[10ms,35ms)，Δγ∈[0.2,0.8)则判断为疲劳等级2，意识较模糊，思维松懈。若Δσ_sp≥0.01,Δσ_RR≥35ms，Δγ≥0.8则判断为疲劳等级3，意识模糊，思维无法集中。Comprehensively analyze the changes of σ _sp , σ _RR , and γ. If Δσ _sp <0.005, Δσ _RR <10ms, and Δγ<0.2, it is judged as fatigue level 1, with clear consciousness and active thinking. If Δσ _sp ∈[0.005,0.01), Δσ _RR ∈[10ms,35ms), and Δγ∈[0.2,0.8), it is judged as fatigue level 2, with blurred consciousness and slack thinking. If Δσ _sp ≥0.01, Δσ _RR ≥35ms, and Δγ≥0.8, it is judged as fatigue level 3, with blurred consciousness and inability to concentrate.

如图2所示，学习者正在居家学习，在该场景下，利用短波红外摄像头和激光雷达扫描仪采集学习者的面部视频序列。该场景下采集到的多帧学习者的SWIR图像、3D点云图像为头部姿态估计模块提供了重要的数据来源。As shown in Figure 2, the learner is studying at home. In this scenario, a short-wave infrared camera and a lidar scanner are used to collect the learner's facial video sequence. The multi-frame SWIR images and 3D point cloud images of the learner collected in this scenario provide an important data source for the head posture estimation module.

如图3所示，本实施例中，多维特征融合自注意力网络包括人脸区域及面部特征点定位模块，头部点云分割模块，头部姿态二维特征提取模块，头部姿态三维特征提取模块，自注意力加权模块，柯西标签分布回归模块。As shown in Figure 3, in this embodiment, the multi-dimensional feature fusion self-attention network includes a face area and facial feature point positioning module, a head point cloud segmentation module, a head posture two-dimensional feature extraction module, a head posture three-dimensional feature extraction module, a self-attention weighted module, and a Cauchy label distribution regression module.

步骤3.1.1：将交互对象的每帧SWIR图像大小调整为624×624像素，输入到在人脸数据集上预训练好的轻量级Mask R-CNN网络中，获得人脸区域(I_x,I_y,m,n)；Step 3.1.1: Resize each frame of the SWIR image of the interacting object to 624×624 pixels and input it into the lightweight Mask R-CNN network pre-trained on the face dataset to obtain the face region (I _x ,I _y ,m,n);

步骤3.1.2：根据人脸区域(I_x,I_y,m,n)裁剪交互对象的每帧SWIR图像，输入稀疏面部特征点提取网络，该网络由全局粗略特征点提取网络RG-Net和级联局部细化网络FL-Net组成；Step 3.1.2: Crop each frame of the SWIR image of the interacting object according to the face region (I _x ,I _y ,m,n) and input it into the sparse facial feature point extraction network, which consists of a global coarse feature point extraction network RG-Net and a cascaded local refinement network FL-Net;

步骤3.1.3：裁剪后的人脸区域SWIR图像输入RG-Net，RG-Net网络结构可表示为{conv1-res1-res2-res3-glDSC-fc}，其中conv1代表卷积层，res代表残差连接层，glDSC代表全局通道可分离卷积，fc代表全连接层，回归最终的全局粗略特征点坐标向量P₀；Step 3.1.3: The cropped face area SWIR image is input into RG-Net. The RG-Net network structure can be expressed as {conv1-res1-res2-res3-glDSC-fc}, where conv1 represents the convolution layer, res represents the residual connection layer, glDSC represents the global channel separable convolution, and fc represents the fully connected layer. The final global rough feature point coordinate vector P ₀ is regressed;

步骤3.1.4：取RG-Net网络中res1层的输出特征图在对应处裁剪以粗略特征点(x_j,y_j)为中心的p×q大小的特征图，得到一级细化特征图将F_R1输入局部细化网络FL-Net，局部细化网络首先通过卷积将多通道特征图降维到二维向量，再经过归一化和relu激活函数非线性变换，最终全连接层回归一级特征向量一级细化面部特征点坐标向量 Step 3.1.4: Take the output feature map of the res1 layer in the RG-Net network At the corresponding position, a feature map of size p×q centered on the coarse feature point (x _j , y _j ) is cropped to obtain a first-level refined feature map _FR1 is input into the local refinement network FL-Net. The local refinement network first reduces the multi-channel feature map to a two-dimensional vector through convolution, and then undergoes normalization and relu activation function nonlinear transformation. Finally, the fully connected layer regresses the first-level feature vector First-level refined facial feature point coordinate vector

步骤3.1.5：取RG-Net网络中conv1层的输出特征图在对应处裁剪以粗略特征点(x_j,y_j)为中心的p×q大小的特征图得到二级细化特征图输入局部细化网络FL-Net得到二级特征向量二级细化面部特征点坐标向量P₂ ^T即为最终的稀疏面部特征点坐标向量。提取流程可以表述为如下：Step 3.1.5: Take the output feature map of the conv1 layer in the RG-Net network At the corresponding position, a feature map of size p×q centered on the coarse feature point (x _j , y _j ) is cropped to obtain a secondary refined feature map Input the local refinement network FL-Net to obtain the secondary feature vector Second level refined facial feature point coordinate vector P ₂ ^T is the final sparse facial feature point coordinate vector. The extraction process can be described as follows:

P_l＝P_l-1+PL_l(ψ(RG(I)_l,P_l-1))#(3)P _l ＝P _l-1 +PL _l (ψ(RG(I) _l ,P _l-1 ))#(3)

其中P₀为全局粗略特征点提取网络RG-Net的输出，l代表层数，FL_l代表局部细化网络FL-Net级联l次，RG(I)_l代表全局粗略特征点提取网络RG-Net第l层的输出特征图，ψ(.)代表特征复用，即构建以粗略特征点(x_j,y_j)为中心的p×q大小的特征图。Where _P0 is the output of the global coarse feature point extraction network RG-Net, l represents the number of layers, FL _l represents the local refinement network FL-Net cascaded l times, RG(I) _l represents the output feature map of the lth layer of the global coarse feature point extraction network RG-Net, ψ(.) represents feature reuse, that is, constructing a feature map of size p×q centered on the coarse feature point (x _j ,y _j ).

头部姿态二维特征提取模型包括通道可分离卷积模块、像素空间Transformer模块、融合特征拓扑图构建模块、自适应图卷积模块。所述通道可分离卷积模块将SWIR图像转换为多通道局部特征图，用于提取预处理后的SWIR人脸区域图像像素空间的局部特征。所述像素空间Transformer从多通道局部特征图中提取像素空间全局特征关系，生成像素空间融合特征图。所述融合特征拓扑图构建模块包括融合特征图顶点与拓扑连接矩阵T的构造。所述自适应图卷积模块提取拓扑图各顶点对间权重关系，更新图顶点的值，得到新维度的头部姿态融合特征拓扑图。The head posture two-dimensional feature extraction model includes a channel separable convolution module, a pixel space Transformer module, a fusion feature topology map construction module, and an adaptive graph convolution module. The channel separable convolution module converts the SWIR image into a multi-channel local feature map for extracting local features in the pixel space of the preprocessed SWIR face area image. The pixel space Transformer extracts the pixel space global feature relationship from the multi-channel local feature map to generate a pixel space fusion feature map. The fusion feature topology map construction module includes a fusion feature map vertex The adaptive graph convolution module extracts the weight relationship between each vertex pair of the topological graph, updates the value of the graph vertex, and obtains a head posture fusion feature topological graph of a new dimension.

步骤3.2.1：将已定位好的人脸区域窗口调整至同时将人脸区域窗口大小调整至328×328,将一组裁剪好的SWIR人脸区域图像I_swir∈R^N×H×W×C输入到双分支通道可分离卷积网络中提取图像的局部特征；Step 3.2.1: Move the located face area window Adjust to At the same time, the face area window size is adjusted to 328×328, and a set of cropped SWIR face area images I _swir ∈R ^N×H×W×C are input into a dual-branch channel separable convolutional network to extract local features of the image;

步骤3.2.2：I_swir输入第I分支，该分支结构为{SC_MAX(16)，SC₁(32)，SC_MAX(32)},其中SC₁模块结构为[SC,BN,RL]，SC代表通道可分离卷积，对各通道逐点卷积提取局部特征，BN代表对该批输入的一批图像做归一化处理，批归一化的计算公式为 Γ＝{Γ₁,,,Γ_N×H×W}代表每个通道对应的一组元素即像素值。RL激活函数将负值元素替换为零，使得特征图元素值更易于收敛。SC_MAX即在SC₁基础上做局部最大化Patch_Max，得到头部姿态局部特征图I_{s1_1}∈R^{N×H′×W′×C′}；Step 3.2.2: I _swir inputs the Ith branch, the branch structure is {SC_MAX(16), SC ₁ (32), SC_MAX(32)}, where the SC ₁ module structure is [SC, BN, RL], SC represents channel-separable convolution, and each channel is convolved point by point to extract local features, and BN represents the normalization of a batch of images input in the batch. The calculation formula of batch normalization is Γ＝{Γ ₁ ,,,Γ _N×H×W } represents a set of elements corresponding to each channel, i.e., pixel values. The RL activation function replaces negative elements with zero, making it easier for the feature map element values to converge. SC_MAX is to perform local maximization Patch_Max based on SC ₁ to obtain the head posture local feature map I _{s1_1} ∈R ^{N×H′×W′×C′} ;

步骤3.2.3：I_swir输入第II分支，该分支结构为{SC_AVE(16)，SC₂(32)，SC_AVE(32)},其中SC₂模块结构为[SC,BN,TH]，TH激活函数使元素值域归一化到(-1,1)，使得网络更易于收敛。SC_AVE即在SC₂基础上做局部平均化Patch_Ave.最得到头部姿态局部特征图I_{s2_1}∈R^{N×H′×W′×C′}。Step 3.2.3: I _swir inputs the second branch, the branch structure is {SC_AVE(16), SC ₂ (32), SC_AVE(32)}, where the SC ₂ module structure is [SC, BN, TH], and the TH activation function normalizes the element value range to (-1, 1), making the network easier to converge. SC_AVE is based on SC ₂ to perform local averaging Patch_Ave. The local feature map of head posture I _{s2_1} ∈R ^{N×H′×W′×C′ is} obtained.

步骤3.3.1：将局部特征图I_{s1_1}，I_{s2_1}输入双分支二阶段像素空间Transformer网络中提取像素空间全局特征，生成融合特征图；Step 3.3.1: Input the local feature maps I _{s1_1} and I _{s2_1} into the dual-branch two-stage pixel space Transformer network to extract the pixel space global features and generate a fused feature map;

步骤3.3.2：将I_{s1_1}输入第I分支，第I分支第一阶段结构为{SC_MAX(32)-Transformer-Patch_Max}，第二阶段结构为{SC_MAX(32)-Transformer}。像素空间Transformer层由三个像素空间Transformer编码器级联组成，提取像素空间全局特征关系；Step 3.3.2: Input I _{s1_1} into the I branch. The first stage structure of the I branch is {SC_MAX(32)-Transformer-Patch_Max}, and the second stage structure is {SC_MAX(32)-Transformer}. The pixel space Transformer layer consists of three pixel space Transformer encoders cascaded to extract the global feature relationship of the pixel space;

步骤3.3.3：像素空间Transformer编码器将可分离卷积层SC的输出特征图I_sc∈R^N ^{×H″×W″×C′}拉伸为三维嵌入向量I_emb∈R^{N×A′×C′},其中A′＝H″×W″；Step 3.3.3: The pixel-space Transformer encoder stretches the output feature map I _sc ∈ ^RN ^{×H″×W″×C′} of the separable convolutional layer SC into a three-dimensional embedding vector I _emb ∈ ^{RN×A′×C′} , where A′＝H″×W″;

步骤3.3.4：分别给每幅图像对应的嵌入向量i∈[1,N]的每个元素即像素点添加位置编码其中m∈[0,A′-1],n∈[0,(C′-1)/2]。嵌入向量I_emb更新为I_emb+I_P。将I_emb+I_P输入多头自注意力映射模块；Step 3.3.4: Give each image the corresponding embedding vector Each element of i∈[1,N], i.e., pixel point, adds position code Where m∈[0,A′-1],n∈[0,(C′-1)/2]. The embedding vector _Iemb is updated to _Iemb + _Ip . _Iemb + _Ip is input into the multi-head self-attention mapping module;

步骤3.3.5：多头自注意力映射模块包含8通道自注意力头，每个通道的自注意力映射计算表达式如下：Step 3.3.5: The multi-head self-attention mapping module contains 8-channel self-attention heads, and the self-attention mapping of each channel The calculation expression is as follows:

其中每个通道的自映射权值矩阵基于输入得到，访问向量R_v与键值向量P_v，由输入I_emb+I_P与点乘后得到。多头自注意力映射模块最终输出为 The self-mapping weight matrix of each channel is Based on the input, the access vector R _v and the key value vector P _v are obtained by input I _emb +I _P and After dot multiplication, we get:

步骤3.3.6：多头自注意力映射模块的输出I_A经过残差归一化层和全连接层，再经过归一化得到像素空间Transformer编码器的输出，计算流程可以概括为如下：Step 3.3.6: The output of the multi-head self-attention mapping module I _A passes through the residual normalization layer and the fully connected layer, and then normalized to obtain the output of the pixel space Transformer encoder. The calculation process can be summarized as follows:

I_TF＝Norm(f(max(0,Norm(I_emb+I_A)))+Norm(I_emb+I_A))#(5)I _TF =Norm(f(max(0,Norm(I _emb +I _A )))+Norm(I _emb +I _A ))#(5)

Norm将每层A′×C′个像素点归一化为标准正态分布,f(.)代表线性变换。三个像素空间Transformer编码器级联，最终得到头部姿态融合特征图MAP₁∈R^{N×H″′×W″′×C′}；Norm normalizes the A′×C′ pixels in each layer to a standard normal distribution, and f(.) represents a linear transformation. The three pixel space Transformer encoders are cascaded to finally obtain the head posture fusion feature map MAP ₁ ∈R ^{N×H″′×W″′×C′} ;

步骤3.3.7：将I_{s2_1}输入第II分支，第II分支与第I分支结构类似，但基于局部平均提取不同的特征图，第一阶段结构为{SC_AVE(32)-Transformer-Patch_Ave}，第二阶段结构为{SC_AVE(32)-Transformer}，最终得到头部姿态融合特征图MAP₂∈R^{N×H″′×W″′×C′}。Step 3.3.7: Input I _{s2_1} into the II branch. The structure of the II branch is similar to that of the I branch, but different feature maps are extracted based on local average. The structure of the first stage is {SC_AVE(32)-Transformer-Patch_Ave}, and the structure of the second stage is {SC_AVE(32)-Transformer}. Finally, the head posture fusion feature map MAP ₂ ∈R ^{N×H″′×W″′×C′} is obtained.

步骤3.4.1：3D点云拓扑图构建模块包括3D点云图顶点与拓扑连接矩阵T的构造。根据面部特征点二维坐标信息P₂ ^T，从点云集合pic₁中选取对应的3D点云坐标。3D点云图顶点的值即为面部关键点3D点云坐标值pic_key＝(x_{key_i},y_{key_i},z_{key_i}),i＝1,,,25；Step 3.4.1: 3D point cloud topology construction module includes 3D point cloud vertex The topological connection matrix T is constructed. According to the two-dimensional coordinate information of the facial feature points P ₂ ^T , the corresponding 3D point cloud coordinates are selected from the point cloud set pic _1. 3D point cloud vertex The value is the 3D point cloud coordinate value of the facial key point pic _key = (x _{key_i} , y _{key_i} , z _{key_i} ), i = 1,,,25;

步骤3.4.2：基于KD-Tree算法寻找各图顶点在欧氏空间中距离最近的5个顶点进行连接，构建拓扑连接矩阵T∈R^N×N,N为特征点个数,T(i,j)值为1代表图顶点相连，反之为0；Step 3.4.2: Find the vertices of each graph based on the KD-Tree algorithm Connect the five vertices closest to each other in Euclidean space and construct a topological connection matrix T∈R ^N×N , where N is the number of feature points. The value of T(i,j) is 1, which means that the vertices are connected, otherwise it is 0.

步骤3.5.1：自适应图卷积模块网络结构为自适应图卷积层，批归一化层，RL函数激活层，1维卷积层，批归一化层，RL函数激活层，自适应图卷积层提取拓扑图各顶点对间权重关系，对应更新图顶点的值。1维卷积层进一步提取序列间的关系，批归一化操作和RL激活函数让网络更易于收敛，最终将多个输入的特征拓扑图G的图顶点值v_i更新为192维的特征值 Step 3.5.1: The network structure of the adaptive graph convolution module is adaptive graph convolution layer, batch normalization layer, RL function activation layer, 1D convolution layer, batch normalization layer, RL function activation layer. The adaptive graph convolution layer extracts the weight relationship between each vertex pair of the topological graph and updates the value of the graph vertex accordingly. The 1D convolution layer further extracts the relationship between sequences. The batch normalization operation and RL activation function make the network easier to converge. Finally, the vertex value v _i of the feature topological graph G of multiple inputs is updated to the 192-dimensional feature value

步骤3.5.2：自适应图卷积层选取特征图G各顶点Vⁿ及其邻域距离最近的K个顶点构成顶点对，对于每个顶点对构建M个通道，每个通道独立计算特征值,计算过程概括如下：Step 3.5.2: The adaptive graph convolution layer selects the K vertices closest to each vertex ^Vn of the feature graph G and its neighborhood Constitute vertex pairs, for each vertex pair Construct M channels, and calculate the eigenvalue of each channel independently. The calculation process is summarized as follows:

其中[A,B]代表A,B向量级联，⊙代表点积，Φ(.)代表MLP层，RL(.）代表RL函数非线性激活，将负值元素转换为0；Where [A,B] represents the concatenation of A and B vectors, ⊙ represents the dot product, Φ(.) represents the MLP layer, and RL(.) represents the nonlinear activation of the RL function, converting negative elements to 0;

步骤3.5.3：每个顶点对特征值更新为M维的特征向量级联K个顶点对特征值经过通道最大池化得到更新后的图顶点其中取K为6，M为192。Step 3.5.3: Update the eigenvalue of each vertex pair to an M-dimensional eigenvector Concatenate K vertex pairs eigenvalues After channel max pooling, the updated graph vertices are obtained Here K is taken as 6 and M is taken as 192.

按上述方案，所述自注意力加权模块过程如下：According to the above scheme, the self-attention weighted module process is as follows:

步骤3.6.1：自注意力加权模块包括自注意力层，全连接层，softmax回归层。上一模块更新后的三维点云拓扑图和二维融合特征拓扑图输入自注意力层后更新为计算过程概括如下：Step 3.6.1: The self-attention weighted module includes a self-attention layer, a fully connected layer, and a softmax regression layer. The updated 3D point cloud topology map of the previous module and two-dimensional fusion feature topology map After entering the self-attention layer, it is updated to The calculation process is summarized as follows:

步骤3.6.2：全连接层计算过程可表述为下式，其中f(.)代表线性变换，将映射为1×N′维的向量Step 3.6.2: The calculation process of the fully connected layer can be expressed as follows, where f(.) represents a linear transformation. Mapped to a 1×N′ dimensional vector

最后利用softmax函数计算和的加权参数α₁，α₂，计算过程可表示为：Finally, the softmax function is used to calculate and The weighted parameters α ₁ , α ₂ , the calculation process can be expressed as:

得到最后的加权融合特征拓扑图为 The final weighted fusion feature topology is

按上述方案，所述柯西标签分布回归模块过程如下：According to the above scheme, the process of the Cauchy label distribution regression module is as follows:

步骤3.6.3：通过全连接层将加权融合特征拓扑图映射为多维特征向量，回归准确的头部姿态的角度，计算与真实角度的MAE作为损失函数Loss_M。Step 3.6.3: Use the fully connected layer to combine the weighted fusion feature topology Map it into a multi-dimensional feature vector, regress the accurate head posture angle, and calculate the MAE with the true angle as the loss function Loss _M.

步骤3.6.4：考虑到对于相同的头部姿态角度变化，沿着三个方向Yaw，Pitch，Roll的头部姿态相似性不同，将三个方向的偏转角度{-90°,,,0,,,90°}分别分割为46，100，62段，即将角度编码为对应标签集A＝{A₁,,,A₄₅}，B＝{B₁,,,B₉₉}，C＝{C₁,,,C₆₁}。Step 3.6.4: Considering that for the same head posture angle change, the head posture similarities along the three directions of Yaw, Pitch, and Roll are different, the deflection angles in the three directions {-90°,,,0°,,,90°} are divided into 46, 100, and 62 segments respectively, that is, the angles are encoded into the corresponding label sets A = {A ₁ ,,,A ₄₅ }, B = {B ₁ ,,,B ₉₉ }, C = {C ₁ ,,,C ₆₁ }.

步骤3.6.5：对于每个训练集图像I_i，把实际角度标签转换为柯西标签分布其中元素值可表示为Step 3.6.5: For each training set image I _i , convert the actual angle label to the Cauchy label distribution in Element Value It can be expressed as

其中i表示第i个标签，t_y表示真实偏航角对应的编码值，标签标准差Δ₁设置为4。元素值可表示为Where i represents the i-th label, _ty represents the encoding value corresponding to the true yaw angle, and the label standard deviation _Δ1 is set to 4. Element Value It can be expressed as

其中j表示第j个标签，t_p表示真实俯仰角对应的编码值，标签标准差Δ₂设置为10。元素值可表示为Where j represents the jth label, _tp represents the encoding value corresponding to the true pitch angle, and the label standard deviation _Δ2 is set to 10. Element Value It can be expressed as

其中k表示第k个标签，t_r表示真实翻滚角对应的编码值，标签标准差Δ₃设置为6。Where k represents the kth label, t _r represents the encoding value corresponding to the true roll angle, and the label standard deviation Δ ₃ is set to 6.

同时该模块训练网络生成三组参数δ,η,ζ，分别对应三组标签集A,B,C。得到预测柯西标签概率分布(P^A(I_i；δ),P^B(I_i；η),P^C(I_i；ζ))。At the same time, the module trains the network to generate three sets of parameters δ, η, ζ, corresponding to three sets of label sets A, B, and C. The predicted Cauchy label probability distribution ( ^PA ( _Ii ; δ), ^PB ( _Ii ; η), ^PC ( _Ii ; ζ)) is obtained.

计算预测柯西标签概率分布与实际柯西标签分布的空间距离Loss_θ和KL散度作为损失函数Loss_G，与损失函数Loss_M加权得到最终的损失函数Loss_total＝Loss_G+0.06Loss_M。Calculate the spatial distance Loss _θ and KL divergence between the predicted Cauchy label probability distribution and the actual Cauchy label distribution As the loss function Loss _G , it is weighted with the loss function Loss _M to obtain the final loss function Loss _total = Loss _G + 0.06 Loss _M .

步骤3.6.6：计算预测柯西标签概率分布与实际柯西标签分布的空间距离Loss_θ和KL散度 Step 3.6.6: Calculate the spatial distance Loss _θ and KL divergence between the predicted Cauchy label probability distribution and the actual Cauchy label distribution

最终的损失函数按上述方案，预先根据损失函数在训练集上训练得到最优网络参数，将学习者的短波红外图像和3d点云数据输入预训练好的多维度特征融合自注意力网络，即可最终的头部姿态角(Yaw，Pitch，Roll)。The final loss function According to the above scheme, the optimal network parameters are obtained by training on the training set according to the loss function in advance, and the learner's short-wave infrared image and 3D point cloud data are input into the pre-trained multi-dimensional feature fusion self-attention network to obtain the final head posture angle (Yaw, Pitch, Roll).

步骤4：根据不同时刻学习者的头部姿态角及面部特征点位置，同时结合血氧饱和度及心电信号变化情况判定疲劳等级，综合评估学习者注意力状态。同时判断是否位于注意力不集中区间，若位于该区间则该学习者此时注意力不集中，反之则注意力集中。Step 4: Determine the fatigue level based on the learner's head posture angle and facial feature point positions at different times, combined with the changes in blood oxygen saturation and ECG signals, and comprehensively evaluate the learner's attention state. At the same time, determine whether it is in the inattention interval. If it is in this interval, the learner is inattentive at this time, otherwise, the learner is focused.

表1学习者注意力状态综合评估规则Table 1 Comprehensive evaluation rules for learners’ attention status

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It will be easily understood by those skilled in the art that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the present invention.

Claims

1. A learning attention state evaluation method based on a multi-dimensional feature fusion network, characterized in that it comprises the steps of:

Obtain learner video resources collected by binocular imaging devices (short-wave infrared cameras, lidar scanners) on the desk and divide them into multiple frames; wearable devices on the hands obtain learners' blood oxygen saturation and heart rate signals;

Locate the face area and facial feature points of the learner's SWIR image. Segment the 3D point cloud of the head area;

The SWIR image of the face area and the 3D point cloud set of the head are input into the corresponding feature extraction network to obtain the feature topology map. After being fused by the self-attention weighted module, they are input into the Cauchy label distribution regression module to obtain the learner's head posture angle. The blood oxygen saturation and heart rate change features are extracted to determine the learner's fatigue level;

Comprehensively assess the learner's attention status based on the learner's head posture angle, changes in the position of facial feature points, and fatigue level, and remind the learner if he/she is not concentrating;

Statistics on attention status during learning process and provide statistical analysis report.

2. A learning attention state assessment method based on a multi-dimensional feature fusion network as claimed in claim 1, characterized in that the face area and facial feature point positioning module process is as follows:

Step 1.1.1: Resize each frame of the SWIR image of the interacting object to 624×624 pixels and input it into the pre-trained lightweight Mask R-CNN network to obtain the face region (I _x ,I _y ,m,n);

Step 1.2.1: Input the cropped face area SWIR image into the global coarse feature point extraction network RG-Net, the structure is {conv1-res1-res2-res3-glDSC-fc}, where res represents the residual connection layer, glDSC represents the global channel separable convolution, and fc represents the fully connected layer, and regress the final global coarse feature point coordinate vector P ₀ ;

Step 1.2.2: Take the res1 layer output feature map of RG-Net Crop the feature map of size p×q centered on the coarse feature point (x _j ,y _j ) and input it into the local refinement network FL-Net to extract the feature vector The coordinate vector of the first-level refined facial feature points is

Step 1.2.3: Take the output feature map of the conv1 layer of RG-Net Crop the feature map of size p×q centered on the coarse feature point (x _j ,y _j ) and input it into the local refinement network FL-Net to extract the feature vector It is the coordinate vector of sparse facial feature points.

3. A learning attention state assessment method based on a multi-dimensional feature fusion network as described in claim 1, characterized in that the head posture two-dimensional feature extraction model includes a channel separable convolution module, a pixel space Transformer module, a fusion feature topology map construction module, and an adaptive graph convolution module.

4. A learning attention state evaluation method based on a multi-dimensional feature fusion network as claimed in claim 3, characterized in that the training process of the channel separable convolution module is as follows:

Step 2.1.1: Input the SWIR face region image I _swir ∈ ^{R N×H×W×C} into a dual-branch channel-separable convolutional network to extract local features of the two-dimensional image;

Step 2.1.2: The structure of the first branch is {SC_MAX(16)-SC ₁ (32)-SC_MAX(32)}, where the SC ₁ module structure is [SC, BN, RL], SC represents channel-separable convolution, which extracts local features from each channel point-by-point convolution, BN represents normalization of the batch of input images, and the RL activation function replaces negative elements with zero. SC_MAX is local maximization based on SC ₁ , and the local feature map of head posture I _{s1_1} ∈R ^{N×H′×W′×C′} is obtained;

Step 2.1.3: The structure of the second branch is {SC_AVE(16)-SC ₂ (32)-SC_AVE(32)}, where the SC ₂ module structure is [SC, BN, TH], the TH activation function normalizes the elements to (-1, 1), and SC_AVE performs local averaging on the basis of SC ₂ , and finally obtains the local feature map of head posture I _{s2_1} ∈R ^{N×H′×W′×C′} .

5. A learning attention state evaluation method based on a multi-dimensional feature fusion network as claimed in claim 3, characterized in that the pixel space Transformer module training process is as follows:

Step 2.2.1: Input I _{s1_1} into the I branch. The first stage of the I branch is {SC_MAX(32)-Transformer-Patch_Max}. The second stage is {SC_MAX(32)-Transformer}. The pixel space Transformer layer consists of three pixel space Transformer encoders cascaded to extract the global feature relationship of the pixel space;

Step 2.2.2: The pixel-space Transformer encoder stretches the output feature map of the SC layer _Isc∈RN ^{×H″×W″×C′} into a three-dimensional vector _Iemb and adds the positional encoding Input I _emb + I _P into the multi-head self-attention mapping module, and finally output It is obtained by multiplying the input with the self-mapping weight matrix and performing nonlinear transformation. _{I A} passes through the residual normalization layer and the fully connected layer to obtain the output of the Transformer encoder. The three pixel space Transformer encoders are cascaded to finally obtain the head posture fusion feature map MAP ₁ ∈R ^{N×H″′×W″′×C′} ;

Step 2.2.3: Input I _{s2_1} into the second branch. The structure of the second branch is similar to that of the first branch. Different feature maps are extracted based on the local average Patch_Ave, and finally the head posture fusion feature map MAP ₂ ∈RN ^{×H″′×W″′×C′} is obtained.

6. A learning attention state evaluation method based on a multi-dimensional feature fusion network as claimed in claim 4, characterized in that the fusion feature topology map construction module includes a fusion feature map vertex The topological connection matrix T is constructed. The elements of MAP ₁ and MAP ₂ are point-multiplied and mapped to the low-dimensional fusion feature vector through the fully connected layer. Head pose fusion feature map vertices The value of is the fusion feature vector M of a single image, and the head posture fusion feature topology map and the 3D point cloud topology map share the topological connection matrix T.

7. A learning attention state assessment method based on a multi-dimensional feature fusion network as described in claim 1, characterized in that the head posture three-dimensional feature extraction model includes a facial feature point 3D point cloud topology map construction module and an adaptive graph convolution module.

8. A learning attention state assessment method based on a multi-dimensional feature fusion network as claimed in claim 7, characterized in that the facial feature point 3D point cloud topology map construction module process is as follows:

Step 3.1.1: 3D point cloud topology construction module includes 3D point cloud vertex The topological connection matrix T is constructed. According to the two-dimensional coordinate information of the facial feature points P ₂ ^T , the corresponding 3D point cloud coordinates are selected from the point cloud set pic _1. 3D point cloud vertex The value of is the 3D point cloud coordinate value of the facial key point;

Step 3.1.2: Find the vertices of each graph based on the KD-Tree algorithm The five vertices closest to each other in Euclidean space are connected to construct a topological connection matrix T∈R ^N×N , where N is the number of feature points. The value of T(i,j) is 1 if the vertices are connected, otherwise it is 0.

9. A method for evaluating learning attention state based on a multi-dimensional feature fusion network according to claim 1, characterized in that the process of the adaptive graph convolution module is as follows:

Step 4.1.1: Select the K vertices closest to each vertex ^Vn of the feature graph G and its neighborhood Construct vertex pairs and update eigenvalues;

Step 4.1.2: For each vertex pair Construct M channels, and calculate the eigenvalues of each channel independently [A,B] represents the concatenation of vectors A and B, ⊙ represents the dot product, Φ(.) represents the MLP layer, and RL(.) represents the nonlinear activation of the RL function, which converts negative elements to 0. The eigenvalues of all vertex pairs are concatenated and pooled through the channel maximum to obtain the updated M-dimensional graph vertex.

10. A learning attention state assessment method based on a multi-dimensional feature fusion network as described in claim 1, characterized in that the blood oxygen saturation-ECG signal feature extraction module includes calculating the mean square error σ _sp of the blood oxygen saturation spo2 sampling points in one cycle and the standard deviation σ _RR of the interval between two adjacent heartbeat signal R waves, the ratio of high frequency to ultra-low frequency energy spectrum density in adjacent R wave cycles